Next: , Previous: Missing Values, Up: Common features


3.22 Chunking

Availability: ncap2, ncbo, ncea, ncecat, ncflint, ncks, ncpdq, ncra, ncrcat, ncwa
Short options: none
Long options: ‘--cnk_dmn dmn_nm,cnk_sz’, ‘--chunk_dimension dmn_nm,cnk_sz
, ‘--cnk_map cnk_map’, ‘--chunk_map cnk_map’,
--cnk_plc cnk_plc’, ‘--chunk_policy cnk_plc’,
--cnk_scl cnk_sz’, ‘--chunk_scalar cnk_sz

All netCDF4-enabled NCO operators that define variables support a plethora of chunksize options. Chunking can significantly accelerate or degrade read/write access to large datasets. Dataset chunking issues are described in detail here.

The NCO chunking implementation is designed to be flexible. Users control three aspects of the chunking implementation. These are known as the chunking policy, chunking map, and chunksize. The first two are high-level mechanisms that apply to an entire file, while the third allows per-dimension specification of parameters. The implementation is a hybrid of the ncpdq packing policies (see ncpdq netCDF Permute Dimensions Quickly), and the hyperslab specifications (see Hyperslabs). Each aspect is intended to have a sensible default, so that most users will only need to set one switch to obtain sensible chunking. Power users can tune the three switches in tandem to obtain optimal performance.

The user specifies the desired chunking policy with the ‘-P’ switch (or its long option equivalents, ‘--cnk_plc’ and ‘--chunk_policy’) and its cnk_plc argument. Five chunking policies are currently implemented:

Chunk All Variables [default]
Definition: Chunk all variables possible
Alternate invocation: ncchunk
cnk_plc key values: ‘all’, ‘cnk_all’, ‘plc_all
Mnemonic: All

Chunk Variables with at least Two Dimensions
Definition: Chunk all variables possible with at least two dimensions
Alternate invocation: none
cnk_plc key values: ‘g2d’, ‘cnk_g2d’, ‘plc_g2d
Mnemonic: Greater than or equal to 2 Dimensions

Chunk Variables with at least Three Dimensions
Definition: Chunk all variables possible with at least three dimensions
Alternate invocation: none
cnk_plc key values: ‘g3d’, ‘cnk_g3d’, ‘plc_g3d
Mnemonic: Greater than or equal to 3 Dimensions

Chunk Variables Containing Explicitly Chunked Dimensions
Definition: Chunk all variables possible that contain at least one dimension whose chunksize was explicitly set with the ‘--cnk_dmn’ option. Alternate invocation: none
cnk_plc key values: ‘xpl’, ‘cnk_xpl’, ‘plc_xpl
Mnemonic: EXPLicitly specified dimensions

Unchunking
Definition: Unchunk all variables
Alternate invocation: ncunchunk
cnk_plc key values: ‘uck’, ‘cnk_uck’, ‘plc_uck’, ‘unchunk
Mnemonic: UnChunK
Equivalent key values are fully interchangeable. Multiple equivalent options are provided to satisfy disparate needs and tastes of NCO users working with scripts and from the command line.

The chunking algorithms must know the chunksizes of each dimension of each variable to be chunked. The correspondence between the input variable shape and the chunksizes is called the chunking map. The user specifies the desired chunking map with the ‘-M’ switch (or its long option equivalents, ‘--cnk_map’ and ‘--chunk_map’) and its cnk_map argument. Four chunking maps are currently implemented:

Chunksize Equals Dimension Size [default]
Definition: Chunksize defaults to dimension size. Explicitly specify chunksizes for particular dimensions with ‘--cnk_dmn’ option.
cnk_map key values: ‘dmn’, ‘cnk_dmn’, ‘map_dmn
Mnemonic: DiMeNsion

Chunksize Equals Dimension Size except Record Dimension
Definition: Chunksize equals dimension size except record dimension has size one. Explicitly specify chunksizes for particular dimensions with ‘--cnk_dmn’ option.
cnk_map key values: ‘rd1’, ‘cnk_rd1’, ‘map_rd1
Mnemonic: Record Dimension size 1

Chunksize Equals Scalar Size Specified
Definition: Chunksize for all dimensions is set with the ‘--cnk_scl’ option.
cnk_map key values: ‘xpl’, ‘cnk_xpl’, ‘map_xpl
Mnemonic: EXPLicitly specified dimensions

Chunksize Product Equals Scalar Size Specified
Definition: The product of the chunksizes for each variable (approximately) equals the size specified with the ‘--cnk_scl’ option. For a variable of rank R (i.e., with R non-degenerate dimensions), the chunksize in each non-degenerate dimension is the Rth root of cnk_scl.
cnk_map key values: ‘prd’, ‘cnk_prd’, ‘map_prd
Mnemonic: PRoDuct
It is possible to combine the above chunking map algorithms with user-specified per-dimension (but not per-variable) chunksizes that override specific chunksizes determined by the maps above. The user specifies the per-dimension chunksizes with the (equivalent) long options ‘--cnk_dmn’ or ‘--chunk_dimension’). The option takes two comma-separated arguments, dmn_nm,cnk_sz, which are the dimension name and its chunksize, respectively. The ‘--cnk_dmn’ option may be used as many times as necessary.

     # Simple chunking and unchunking
     ncks -O -4 --cnk_plc=all     in.nc out.nc # Chunk in.nc
     ncks -O -4 --cnk_plc=unchunk in.nc out.nc # Unchunk in.nc
     
     # Chunk data then unchunk it, printing informative metadata
     ncks -O -4 -D 4 --cnk_plc=all ~/nco/data/in.nc ~/foo.nc
     ncks -O -4 -D 4 --cnk_plc=uck ~/foo.nc ~/foo.nc
     
     # More complex chunking procedures, with informative metadata
     ncks -O -4 -D 4 --cnk_scl=8 ~/nco/data/in.nc ~/foo.nc
     ncks -O -4 -D 4 --cnk_scl=8 /data/zender/dstmch90/dstmch90_clm.nc ~/foo.nc
     ncks -O -4 -D 4 --cnk_dmn lat,64 --cnk_dmn lon,128 /data/zender/dstmch90/dstmch90_clm.nc ~/foo.nc
     ncks -O -4 -D 4 --cnk_plc=uck ~/foo.nc ~/foo.nc
     ncks -O -4 -D 4 --cnk_plc=g2d --cnk_map=rd1 --cnk_dmn lat,32 --cnk_dmn lon,128 /data/zender/dstmch90/dstmch90_clm_0112.nc ~/foo.nc
     
     # Chunking works with all operators...
     ncap2 -O -4 -D 4 --cnk_scl=8 -S ~/nco/data/ncap2_tst.nco ~/nco/data/in.nc ~/foo.nc
     ncbo -O -4 -D 4 --cnk_scl=8 -p ~/nco/data in.nc in.nc ~/foo.nc
     ncecat -O -4 -D 4 -n 12,2,1 --cnk_dmn lat,32 -p /data/zender/dstmch90 dstmch90_clm01.nc ~/foo.nc
     ncflint -O -4 -D 4 --cnk_scl=8 ~/nco/data/in.nc ~/foo.nc
     ncpdq -O -4 -D 4 -P all_new --cnk_scl=8 -L 5 ~/nco/data/in.nc ~/foo.nc
     ncrcat -O -4 -D 4 -n 12,2,1 --cnk_dmn lat,32 -p /data/zender/dstmch90 dstmch90_clm01.nc ~/foo.nc
     ncwa -O -4 -D 4 -a time --cnk_plc=g2d --cnk_map=rd1 --cnk_dmn lat,32 --cnk_dmn lon,128 /data/zender/dstmch90/dstmch90_clm_0112.nc ~/foo.nc

It is appropriate to conclude by informing users about an aspect of chunking that may not be expected: Record dimensions are always chunked with a chunksize of one. Hence all variables that contain the record dimension are also stored as chunked (since data must be stored with chunking either in all dimensions, or in no dimensions). Unless otherwise specified by the user, the other (fixed, non-record) dimensions of such variables are assigned default chunk sizes. The HDF5 layer does all this automatically to optimize the on-disk variable/file storage geometry of record variables. Do not be surprised to learn that files created without any explicit instructions to activate chunking nevertheless contain chunked variables.