3.22 Chunking
Availability: ncap2, ncbo, ncea,
ncecat, ncflint, ncks, ncpdq,
ncra, ncrcat, ncwa
Short options: none
Long options: ‘--cnk_dmn dmn_nm,cnk_sz’,
‘--chunk_dimension dmn_nm,cnk_sz’ ,
‘--cnk_map cnk_map’, ‘--chunk_map cnk_map’,
‘--cnk_plc cnk_plc’, ‘--chunk_policy cnk_plc’,
‘--cnk_scl cnk_sz’, ‘--chunk_scalar cnk_sz’
|
All netCDF4-enabled NCO operators that define variables
support a plethora of chunksize options.
Chunking can significantly accelerate or degrade read/write access
to large datasets.
Dataset chunking issues are described in detail
here.
The NCO chunking implementation is designed to be flexible.
Users control three aspects of the chunking implementation.
These are known as the chunking policy, chunking map,
and chunksize.
The first two are high-level mechanisms that apply to an entire file,
while the third allows per-dimension specification of parameters.
The implementation is a hybrid of the ncpdq packing policies
(see ncpdq netCDF Permute Dimensions Quickly), and the hyperslab
specifications (see Hyperslabs).
Each aspect is intended to have a sensible default, so that most users
will only need to set one switch to obtain sensible chunking.
Power users can tune the three switches in tandem to obtain optimal
performance.
The user specifies the desired chunking policy with the ‘-P’ switch
(or its long option equivalents, ‘--cnk_plc’ and
‘--chunk_policy’) and its cnk_plc argument.
Five chunking policies are currently implemented:
- Chunk All Variables [default]
- Definition: Chunk all variables possible
Alternate invocation: ncchunk
cnk_plc key values: ‘all’, ‘cnk_all’, ‘plc_all’
Mnemonic: All
- Chunk Variables with at least Two Dimensions
- Definition: Chunk all variables possible with at least two dimensions
Alternate invocation: none
cnk_plc key values: ‘g2d’, ‘cnk_g2d’, ‘plc_g2d’
Mnemonic: Greater than or equal to 2 Dimensions
- Chunk Variables with at least Three Dimensions
- Definition: Chunk all variables possible with at least three dimensions
Alternate invocation: none
cnk_plc key values: ‘g3d’, ‘cnk_g3d’, ‘plc_g3d’
Mnemonic: Greater than or equal to 3 Dimensions
- Chunk Variables Containing Explicitly Chunked Dimensions
- Definition: Chunk all variables possible that contain at least one
dimension whose chunksize was explicitly set with the ‘--cnk_dmn’ option.
Alternate invocation: none
cnk_plc key values: ‘xpl’, ‘cnk_xpl’, ‘plc_xpl’
Mnemonic: EXPLicitly specified dimensions
- Unchunking
- Definition: Unchunk all variables
Alternate invocation: ncunchunk
cnk_plc key values: ‘uck’, ‘cnk_uck’, ‘plc_uck’, ‘unchunk’
Mnemonic: UnChunK
Equivalent key values are fully interchangeable.
Multiple equivalent options are provided to satisfy disparate needs
and tastes of NCO users working with scripts and from the
command line.
The chunking algorithms must know the chunksizes of each dimension of
each variable to be chunked.
The correspondence between the input variable shape and the chunksizes
is called the chunking map.
The user specifies the desired chunking map with the ‘-M’ switch
(or its long option equivalents, ‘--cnk_map’ and
‘--chunk_map’) and its cnk_map argument.
Four chunking maps are currently implemented:
- Chunksize Equals Dimension Size [default]
- Definition: Chunksize defaults to dimension size.
Explicitly specify chunksizes for particular dimensions with
‘--cnk_dmn’ option.
cnk_map key values: ‘dmn’, ‘cnk_dmn’, ‘map_dmn’
Mnemonic: DiMeNsion
- Chunksize Equals Dimension Size except Record Dimension
- Definition: Chunksize equals dimension size except record dimension has size one.
Explicitly specify chunksizes for particular dimensions with
‘--cnk_dmn’ option.
cnk_map key values: ‘rd1’, ‘cnk_rd1’, ‘map_rd1’
Mnemonic: Record Dimension size 1
- Chunksize Equals Scalar Size Specified
- Definition: Chunksize for all dimensions is set with the
‘--cnk_scl’ option.
cnk_map key values: ‘xpl’, ‘cnk_xpl’, ‘map_xpl’
Mnemonic: EXPLicitly specified dimensions
- Chunksize Product Equals Scalar Size Specified
- Definition: The product of the chunksizes for each variable
(approximately) equals the size specified with the ‘--cnk_scl’
option.
For a variable of rank R (i.e., with R non-degenerate
dimensions), the chunksize in each non-degenerate dimension is the
Rth root of cnk_scl.
cnk_map key values: ‘prd’, ‘cnk_prd’, ‘map_prd’
Mnemonic: PRoDuct
It is possible to combine the above chunking map algorithms with
user-specified per-dimension (but not per-variable) chunksizes that
override specific chunksizes determined by the maps above.
The user specifies the per-dimension chunksizes with the (equivalent)
long options ‘--cnk_dmn’ or ‘--chunk_dimension’).
The option takes two comma-separated arguments,
dmn_nm,cnk_sz, which are the dimension name and its
chunksize, respectively.
The ‘--cnk_dmn’ option may be us as many times as necessary.
# Debugging
ncks -O -4 -D 4 --cnk_scl=8 ~/nco/data/in.nc ~/foo.nc
ncks -O -4 -D 4 --cnk_scl=8 /data/zender/dstmch90/dstmch90_clm.nc ~/foo.nc
ncks -O -4 -D 4 --cnk_dmn lat,64 --cnk_dmn lon,128 /data/zender/dstmch90/dstmch90_clm.nc ~/foo.nc
ncks -O -4 -D 4 --cnk_plc=uck ~/foo.nc ~/foo.nc
ncks -O -4 -D 4 --cnk_plc=g2d --cnk_map=rd1 --cnk_dmn lat,64 --cnk_dmn lon,128 /data/zender/dstmch90/dstmch90_clm.nc ~/foo.nc
# Chunk data then unchunk it back to its original state:
ncks -O -4 -D 4 --cnk_plc=all ~/nco/data/in.nc ~/foo.nc
ncks -O -4 -D 4 --cnk_plc=uck ~/foo.nc ~/foo.nc
# Final, cleaner examples for manual
ncks --cnk_plc=all in.nc out.nc # Chunk in.nc
ncks --cnk_plc=unchunk in.nc out.nc # Unchunk in.nc