Occasionally one desires to digest (i.e., concatenate or average)
hundreds or thousands of input files.
Unfortunately, data archives (e.g., NASA EOSDIS) may not
name netCDF files in a format understood by the ‘-n loop’
switch (see Specifying Input Files) that automagically generates
arbitrary numbers of input filenames.
The ‘-n loop’ switch has the virtue of being concise,
and of minimizing the command line.
This helps keeps output file small since the command line is stored
as metadata in the history
attribute
(see History Attribute).
However, the ‘-n loop’ switch is useless when there is no
simple, arithmetic pattern to the input filenames (e.g.,
h00001.nc, h00002.nc, ... h90210.nc).
Moreover, filename globbing does not work when the input files are too
numerous or their names are too lengthy (when strung together as a
single argument) to be passed by the calling shell to the NCO
operator
1.
When this occurs, the ANSI C-standard argc
-argv
method of passing arguments from the calling shell to a C-program (i.e.,
an NCO operator) breaks down.
There are (at least) three alternative methods of specifying the input
filenames to NCO in environment-limited situations.
The recommended method for sending very large numbers (hundreds or
more, typically) of input filenames to the multi-file operators is
to pass the filenames with the UNIX standard input
feature, aka stdin
:
# Pipe large numbers of filenames to stdin /bin/ls | grep ${CASEID}_'......'.nc | ncecat -o foo.nc
This method avoids all constraints on command line size imposed by
the operating system.
A drawback to this method is that the history
attribute
(see History Attribute) does not record the name of any input
files since the names were not passed on the command line.
This makes determining the data provenance at a later date difficult.
To remedy this situation, multi-file operators store the number of
input files in the nco_input_file_number
global attribute and the
input file list itself in the nco_input_file_list
global attribute
(see File List Attributes).
Although this does not preserve the exact command used to generate the
file, it does retains all the information required to reconstruct the
command and determine the data provenance.
A second option is to use the UNIX xargs command. This simple example selects as input to xargs all the filenames in the current directory that match a given pattern. For illustration, consider a user trying to average millions of files which each have a six character filename. If the shell buffer can not hold the results of the corresponding globbing operator, ??????.nc, then the filename globbing technique will fail. Instead we express the filename pattern as an extended regular expression, ......\.nc (see Subsetting Variables). We use grep to filter the directory listing for this pattern and to pipe the results to xargs which, in turn, passes the matching filenames to an NCO multi-file operator, e.g., ncecat.
# Use xargs to transfer filenames on the command line /bin/ls | grep ${CASEID}_'......'.nc | xargs -x ncecat -o foo.nc
The single quotes protect the only sensitive parts of the extended
regular expression (the grep argument), and allow shell
interpolation (the ${CASEID}
variable substitution) to
proceed unhindered on the rest of the command.
xargs uses the UNIX pipe feature to append the
suitably filtered input file list to the end of the ncecat
command options.
The -o foo.nc
switch ensures that the input files supplied by
xargs are not confused with the output file name.
xargs does, unfortunately, have its own limit (usually about
20,000 characters) on the size of command lines it can pass.
Give xargs the ‘-x’ switch to ensure it dies if it
reaches this internal limit.
When this occurs, use either the stdin
method above, or the
symbolic link presented next.
Even when its internal limits have not been reached, the xargs technique may not be sophisticated enough to handle all situations. A full scripting language like Perl can handle any level of complexity of filtering input filenames, and any number of filenames. The technique of last resort is to write a script that creates symbolic links between the irregular input filenames and a set of regular, arithmetic filenames that the ‘-n loop’ switch understands. For example, the following Perl script a monotonically enumerated symbolic link to up to one million .nc files in a directory. If there are 999,999 netCDF files present, the links are named 000001.nc to 999999.nc:
# Create enumerated symbolic links /bin/ls | grep \.nc | perl -e \ '$idx=1;while(<STDIN>){chop;symlink $_,sprintf("%06d.nc",$idx++);}' ncecat -n 999999,6,1 000001.nc foo.nc # Remove symbolic links when finished /bin/rm ??????.nc
The ‘-n loop’ option tells the NCO operator to
automatically generate the filnames of the symbolic links.
This circumvents any OS and shell limits on command line size.
The symbolic links are easily removed once NCO is finished.
One drawback to this method is that the history
attribute
(see History Attribute) retains the filename list of the symbolic
links, rather than the data files themselves.
This makes it difficult to determine the data provenance at a later date.
[1] The exact length which exceeds the operating system internal
limit for command line lengths varies from OS to OS
and from shell to shell.
GNU bash
may not have any arbitrary fixed limits to the
size of command line arguments.
Many OSs cannot handle command line arguments (including
results of file globbing) exceeding 4096 characters.