Next: , Previous: Large File Support, Up: Common features


3.11 Subsetting Variables

Availability: (ncap2), ncbo, ncea, ncecat, ncflint, ncks, ncpdq, ncra, ncrcat, ncwa
Short options: ‘-v’, ‘-x
Long options: ‘--variable’, ‘--exclude’ or ‘--xcl
Subsetting variables refers to explicitly specifying variables to be included or excluded from operator actions. Subsetting is implemented with the ‘-v var[,...]’ and ‘-x’ options. A list of variables to extract is specified following the ‘-v’ option, e.g., ‘-v time,lat,lon’. Not using the ‘-v’ option is equivalent to specifying all variables. The ‘-x’ option causes the list of variables specified with ‘-v’ to be excluded rather than extracted. Thus ‘-x’ saves typing when you only want to extract fewer than half of the variables in a file.

Variables explicitly specified for extraction with ‘-v var[,...]must be present in the input file or an error will result. Variables explicitly specified for exclusion with ‘-x -v var[,...]’ need not be present in the input file. Remember, if averaging or concatenating large files stresses your systems memory or disk resources, then the easiest solution is often to use the ‘-v’ option to retain only the most important variables (see Memory Requirements).

Due to its special capabilities, ncap2 interprets the ‘-v’ switch differently (see ncap2 netCDF Arithmetic Processor). For ncap2, the ‘-v’ switch takes no arguments and indicates that only user-defined variables should be output. ncap2 neither accepts nor understands the -x switch.

As of NCO 2.8.1 (August, 2003), variable name arguments of the ‘-v’ switch may contain extended regular expressions. As of NCO 3.9.6 (January, 2009), variable names arguments to ncatted may contain extended regular expressions. For example, ‘-v '^DST'’ selects all variables beginning with the string ‘DST’. Extended regular expressions are defined by the GNU egrep command. The meta-characters used to express pattern matching operations are ‘^$+?.*[]{}|’. If the regular expression pattern matches any part of a variable name then that variable is selected. This capability is called wildcarding, and is very useful for sub-setting large data files.

Because of its wide availability, NCO uses the POSIX regular expression library regex. Regular expressions of arbitary complexity may be used. Since netCDF variable names are relatively simple constructs, only a few varieties of variable wildcards are likely to be useful. For convenience, we define the most useful pattern matching operators here:

^
Matches the beginning of a string
$
Matches the end of a string
.
Matches any single character
The most useful repetition and combination operators are
?
The preceding regular expression is optional and matched at most once
*
The preceding regular expression will be matched zero or more times
+
The preceding regular expression will be matched one or more times
|
The preceding regular expression will be joined to the following regular expression. The resulting regular expression matches any string matching either subexpression.
To illustrate the use of these operators in extracting variables, consider a file with variables Q, Q01Q99, Q100, QAAQZZ, Q_H2O, X_H2O, Q_CO2, X_CO2.
     ncks -v 'Q.?' in.nc              # Variables that contain Q
     ncks -v '^Q.?' in.nc             # Variables that start with Q
     ncks -v '^Q+.?.' in.nc           # Q, Q0--Q9, Q01--Q99, QAA--QZZ, etc.
     ncks -v '^Q..' in.nc             # Q01--Q99, QAA--QZZ, etc.
     ncks -v '^Q[0-9][0-9]' in.nc     # Q01--Q99, Q100
     ncks -v '^Q[[:digit:]]{2}' in.nc # Q01--Q99
     ncks -v 'H2O$' in.nc             # Q_H2O, X_H2O
     ncks -v 'H2O$|CO2$' in.nc        # Q_H2O, X_H2O, Q_CO2, X_CO2
     ncks -v '^Q[0-9][0-9]$' in.nc    # Q01--Q99
     ncks -v '^Q[0-6][0-9]|7[0-3]' in.nc # Q01--Q73, Q100
     ncks -v '(Q[0-6][0-9]|7[0-3])$' in.nc # Q01--Q73
     ncks -v '^[a-z]_[a-z]{3}$' in.nc # Q_H2O, X_H2O, Q_CO2, X_CO2

Beware—two of the most frequently used repetition pattern matching operators, ‘*’ and ‘?’, are also valid pattern matching operators for filename expansion (globbing) at the shell-level. Confusingly, their meanings in extended regular expressions and in shell-level filename expansion are significantly different. In an extended regular expression, ‘*’ matches zero or more occurences of the preceding regular expression. Thus ‘Q*’ selects all variables, and ‘Q+.*’ selects all variables containing ‘Q’ (the ‘+’ ensures the preceding item matches at least once). To match zero or one occurence of the preceding regular expression, use ‘?’. Documentation for the UNIX egrep command details the extended regular expressions which NCO supports.

One must be careful to protect any special characters in the regular expression specification from being interpreted (globbed) by the shell. This is accomplish by enclosing special characters within single or double quotes

     ncra -v Q?? in.nc out.nc   # Error: Shell attempts to glob wildcards
     ncra -v '^Q+..' in.nc out.nc # Correct: NCO interprets wildcards
     ncra -v '^Q+..' in*.nc out.nc # Correct: NCO interprets, Shell globs

The final example shows that commands may use a combination of variable wildcarding and shell filename expansion (globbing). For globbing, ‘*’ and ‘?have nothing to do with the preceding regular expression! In shell-level filename expansion, ‘*’ matches any string, including the null string and ‘?’ matches any single character. Documentation for bash and csh describe the rules of filename expansion (globbing).