Availability: ncap2, ncbo, ncea, ncecat,
ncflint, ncpdq, ncra, ncrcat,
ncwa Short options: ‘-t’ Long options: ‘--thr_nbr’, ‘--threads’, ‘--omp_num_threads’ |
OMP_NUM_THREADS
environment variable, if present, or from the
OS, if not.
NCO may modify thr_nbr according to its own internal
settings before it requests any threads from the system.
Certain operators contain hard-code limits to the number of threads they
request.
We base these limits on our experience and common sense, and to reduce
potentially wasteful system usage by inexperienced users.
For example, ncrcat
is extremely I/O-intensive so we restrict
thr_nbr <= 2 for ncrcat
.
This is based on the notion that the best performance that can be
expected from an operator which does no arithmetic is to have one thread
reading and one thread writing simultaneously.
In the future (perhaps with netCDF4), we hope to
demonstrate significant threading improvements with operators
like ncrcat
by performing multiple simultaneous writes.
Compute-intensive operators (ncap
, ncwa
and ncpdq
)
benefit most from threading.
The greatest increases in throughput due to threading occur on
large datasets where each thread performs millions, at least,
of floating point operations.
Otherwise, the system overhead of setting up threads probably outweighs
the speed enhancements due to SMP parallelism.
However, we have not yet demonstrated that the SMP parallelism
scales well beyond four threads for these operators.
Hence we restrict thr_nbr <= 4 for all operators.
We encourage users to play with these limits (edit file
nco_omp.c) and send us their feedback.
Once the initial thr_nbr has been modified for any operator-specific limits, NCO requests the system to allocate a team of thr_nbr threads for the body of the code. The operating system then decides how many threads to allocate based on this request. Users may keep track of this information by running the operator with dbg_lvl > 0.
By default, threaded operators attach one global attribute,
nco_openmp_thread_number
, to any file they create or modify.
This attribute contains the number of threads the operator used to
process the input files.
This information helps to verify that the answers with threaded and
non-threaded operators are equal to within machine precision.
This information is also useful for benchmarking.