Realistic finite element simulations tend to use enormous amounts of computing time and memory. Scientists and programmers have therefore long tried to use the combined power of several processors or computers to tackle these problems.
The usual approach is to use physically separated computers (e.g. clusters) or computing units (e.g. processor nodes in a parallel computer), each of which is equipped with its own memory, and split the problem at hand into separate parts which are then solved on these computing units. Unfortunately, this approach tends to pose significant problems, both for the mathematical formulation as well as for the application programmer, which make the development of such programs overly difficult and expensive.
For these reasons, parallelized implementations and their mathematical background are still subject to intense research. In recent years, however, multi-processor machines have been developed, which pose a reasonable alternative to small parallel computers with the advantage of simple programming and the possibility to use the same mathematical formulation that can also be used for single-processor machines. These computers typically have between two and eight processors that can access the global memory at equal cost.
Due to this uniform memory access (UMA) architecture, communication can be performed in the global memory and is no more costly than access to any other memory location. Thus, there is also no more need to change the mathematical formulation to reduce communication, and programs using this architecture look very much like programs written for single processor machines.
The purpose of this report is to explain the techniques used in
deal.II (see [1,2])
by which we try to program these computers. We will first
give a brief introduction in what threads are and what the problems are which
we have to solve when we want to use multi-threading. The third section takes an
in-depth look at the way in which the functionality of the operating system is
represented in a C++ program in order to allow simple and robust
programming; in particular, we describe the design decisions which led us to
implement these parts of the library in the way they are implemented. In the
fourth section, we show several examples of parallelization and explain how
they work. Readers who are more interested in actually using the framework
laid out in this report, rather than the internals, may skip Section 3 and go
directly to the applications in Section 4 (page
).
Note (2003-01-11): In this report we frequently make reference to the ACE (Adaptive Communications Environment) library. This library has been used in previous versions of the library to start and control threads, and to provide other multithreading features in a cross-platform way. We dropped support for ACE after version 3.4 of the deal.II library since we found that using POSIX functions instead is much simpler to support. One of the main problems with ACE was the complicated installation, while POSIX functions are provided by most modern system's C libraries.
Since the features we used from ACE are limited to the features that POSIX also provides for multithreading, the impact of this replacement on the underlying foundations used in this report are minor, and can be ignored.