Next: , Previous: Metadata Optimization, Up: Common features


3.3 OpenMP Threading

Availability: ncap2, ncbo, ncea, ncecat, ncflint, ncpdq, ncra, ncrcat, ncwa
Short options: ‘-t
Long options: ‘--thr_nbr’, ‘--threads’, ‘--omp_num_threads
NCO supports shared memory parallelism (SMP) when compiled with an OpenMP-enabled compiler. Threads requests and allocations occur in two stages. First, users may request a specific number of threads thr_nbr with the ‘-t’ switch (or its long option equivalents, ‘--thr_nbr’, ‘--threads’, and ‘--omp_num_threads’). If not user-specified, OpenMP obtains thr_nbr from the OMP_NUM_THREADS environment variable, if present, or from the OS, if not.

NCO may modify thr_nbr according to its own internal settings before it requests any threads from the system. Certain operators contain hard-code limits to the number of threads they request. We base these limits on our experience and common sense, and to reduce potentially wasteful system usage by inexperienced users. For example, ncrcat is extremely I/O-intensive so we restrict thr_nbr <= 2 for ncrcat. This is based on the notion that the best performance that can be expected from an operator which does no arithmetic is to have one thread reading and one thread writing simultaneously. In the future (perhaps with netCDF4), we hope to demonstrate significant threading improvements with operators like ncrcat by performing multiple simultaneous writes.

Compute-intensive operators (ncap, ncwa and ncpdq) benefit most from threading. The greatest increases in throughput due to threading occur on large datasets where each thread performs millions, at least, of floating point operations. Otherwise, the system overhead of setting up threads probably outweighs the speed enhancements due to SMP parallelism. However, we have not yet demonstrated that the SMP parallelism scales well beyond four threads for these operators. Hence we restrict thr_nbr <= 4 for all operators. We encourage users to play with these limits (edit file nco_omp.c) and send us their feedback.

Once the initial thr_nbr has been modified for any operator-specific limits, NCO requests the system to allocate a team of thr_nbr threads for the body of the code. The operating system then decides how many threads to allocate based on this request. Users may keep track of this information by running the operator with dbg_lvl > 0.

By default, threaded operators attach one global attribute, nco_openmp_thread_number, to any file they create or modify. This attribute contains the number of threads the operator used to process the input files. This information helps to verify that the answers with threaded and non-threaded operators are equal to within machine precision. This information is also useful for benchmarking.