Next: , Previous: Averaging vs. Concatenating, Up: Strategies


2.7 Large Numbers of Files

Occasionally one desires to digest (i.e., concatenate or average) hundreds or thousands of input files. Unfortunately, data archives (e.g., NASA EOSDIS) may not name netCDF files in a format understood by the ‘-n loop’ switch (see Specifying Input Files) that automagically generates arbitrary numbers of input filenames. The ‘-n loop’ switch has the virtue of being concise, and of minimizing the command line. This helps keeps output file small since the command line is stored as metadata in the history attribute (see History Attribute). However, the ‘-n loop’ switch is useless when there is no simple, arithmetic pattern to the input filenames (e.g., h00001.nc, h00002.nc, ... h90210.nc). Moreover, filename globbing does not work when the input files are too numerous or their names are too lengthy (when strung together as a single argument) to be passed by the calling shell to the NCO operator 1. When this occurs, the ANSI C-standard argc-argv method of passing arguments from the calling shell to a C-program (i.e., an NCO operator) breaks down. There are (at least) three alternative methods of specifying the input filenames to NCO in environment-limited situations.

The recommended method for sending very large numbers (hundreds or more, typically) of input filenames to the multi-file operators is to pass the filenames with the UNIX standard input feature, aka stdin:

     # Pipe large numbers of filenames to stdin
     /bin/ls | grep ${CASEID}_'......'.nc | ncecat -o foo.nc

This method avoids all constraints on command line size imposed by the operating system. A drawback to this method is that the history attribute (see History Attribute) does not record the name of any input files since the names were not passed on the command line. This makes determining the data provenance at a later date difficult. To remedy this situation, multi-file operators store the number of input files in the nco_input_file_number global attribute and the input file list itself in the nco_input_file_list global attribute (see File List Attributes). Although this does not preserve the exact command used to generate the file, it does retains all the information required to reconstruct the command and determine the data provenance.

A second option is to use the UNIX xargs command. This simple example selects as input to xargs all the filenames in the current directory that match a given pattern. For illustration, consider a user trying to average millions of files which each have a six character filename. If the shell buffer can not hold the results of the corresponding globbing operator, ??????.nc, then the filename globbing technique will fail. Instead we express the filename pattern as an extended regular expression, ......\.nc (see Subsetting Variables). We use grep to filter the directory listing for this pattern and to pipe the results to xargs which, in turn, passes the matching filenames to an NCO multi-file operator, e.g., ncecat.

     # Use xargs to transfer filenames on the command line
     /bin/ls | grep ${CASEID}_'......'.nc | xargs -x ncecat -o foo.nc

The single quotes protect the only sensitive parts of the extended regular expression (the grep argument), and allow shell interpolation (the ${CASEID} variable substitution) to proceed unhindered on the rest of the command. xargs uses the UNIX pipe feature to append the suitably filtered input file list to the end of the ncecat command options. The -o foo.nc switch ensures that the input files supplied by xargs are not confused with the output file name. xargs does, unfortunately, have its own limit (usually about 20,000 characters) on the size of command lines it can pass. Give xargs the ‘-x’ switch to ensure it dies if it reaches this internal limit. When this occurs, use either the stdin method above, or the symbolic link presented next.

Even when its internal limits have not been reached, the xargs technique may not be sophisticated enough to handle all situations. A full scripting language like Perl can handle any level of complexity of filtering input filenames, and any number of filenames. The technique of last resort is to write a script that creates symbolic links between the irregular input filenames and a set of regular, arithmetic filenames that the ‘-n loop’ switch understands. For example, the following Perl script a monotonically enumerated symbolic link to up to one million .nc files in a directory. If there are 999,999 netCDF files present, the links are named 000001.nc to 999999.nc:

     # Create enumerated symbolic links
     /bin/ls | grep \.nc | perl -e \
     '$idx=1;while(<STDIN>){chop;symlink $_,sprintf("%06d.nc",$idx++);}'
     ncecat -n 999999,6,1 000001.nc foo.nc
     # Remove symbolic links when finished
     /bin/rm ??????.nc

The ‘-n loop’ option tells the NCO operator to automatically generate the filnames of the symbolic links. This circumvents any OS and shell limits on command line size. The symbolic links are easily removed once NCO is finished. One drawback to this method is that the history attribute (see History Attribute) retains the filename list of the symbolic links, rather than the data files themselves. This makes it difficult to determine the data provenance at a later date.


Footnotes

[1] The exact length which exceeds the operating system internal limit for command line lengths varies from OS to OS and from shell to shell. GNU bash may not have any arbitrary fixed limits to the size of command line arguments. Many OSs cannot handle command line arguments (including results of file globbing) exceeding 4096 characters.