How should I choose between using OpenMP or MPI for parallel
computing?
OpenMP and MPI are the main standards used for parallel programming for
numerically intensive computing. The model for parallelism is related
relatively closely to the hardware the model runs on. OpenMP uses a shared
memory model which is generally only appropriate on computing hardware where the
processors see a shared memory. MPI (Message Passing Interface) is appropriate
for clusters, where the memory (and processors) are distributed and shared
memory is not a core feature of the architecture. MPI can be readily (and quite
effectively) implemented in shared memory, but emulating shared memory in
distributed memory has a large overhead and generally OpenMP cannot be effective
on a distributed memory system.
The effectiveness of parallel programmimng is critically influenced by
relative latency and bandwidth for transferring data between processors (cache)
and local and remote memory which depends on the hardware available (more
generally performance is influenced by a hierarchy of access speeds to cache,
memory, high performance interconnect, disk and network.) The availability of
shared memory, memory-memory interconnect and performance networking hardware is
key. A particular parallel implementation of a model may perform very
differently (or need very different tuning) when ported to different hardware
and OpenMP/MPI implementation.
The over-riding factor in choosing between OpenMP and MPI is availability of
hardware. Since MPI can run on a greater variety of hardware, it may be the
best long-term choice, especially when starting from scratch on code which is
intended to have an on-going life.
If you have hardware with shared memory, then you can choose between OpenMP
and MPI. There are a number of starting points that will profoundly influence
the choice.
- You are starting from scratch and anticipate a need to incorporate
parallelism into the application design so that it will be available when you
need to scale up. This is in some ways the ideal situation that is often not
achieved. The considerations in this document will all be broadly relevant.
- You have an existing serial code and need to improve the turnaround time for
your calculations. This is probably the most common (and difficult) situation.
As will be discussed, incremental parallism is much more achievable with OpenMP.
For MPI parallelism you probably should rewrite essentially from scratch.
- You have an existing parallel code which uses OpenMP or vendor specific
directives for shared memory parallelism. Converting to OpenMP from vendor
specific directives should be relatively straightforward -- or indeed porting
the OpenMP code to work on a new platform. Conversion to MPI is likely to be
little better than for a serial code unless the parallelism is at a very high
level.
- You have an existing parallel code which uses MPI. Work with it. There may
be considerable scope for improving efficiency.
- You are porting an existing parallel code which has both OpenMP and MPI
capability. In this situation you can use a combination of educated choice and
trial and error. A procedure is outlined later in this document.
If you are developing or porting code, considerations about the difficulty of
working with each model of parallelism are important.
- Developing an MPI application is (usually) significantly more difficult and
more work than developing an OpenMP code. It is especially easier to parallelise
an OpenMP code incrementally (profiling provides statistics about the time
subroutines take including call of the children, great to identify the high
level subroutines to tackle). Using MPI incrementally is usually next to
impossible (or involves a lot of temporary MPI calls to exchange information at
the beginning and the end of subroutines to parallelise
-- which adds to the development time.
- OpenMP can be difficult to debug if local/global declaration of variables
is wrong. So carefully look at each variable (perhaps using a variable listing
from the compiler to make sure that no variable is missed) and declare them to
be private/shared. And while MPI development is usually more complex, MPI
debugging is usually easier, since the MPI library will help detecting some
errors (like wrong send/receive sizes, unmatched global operations). 'Help'
here means: abort with an error message during runtime :)
- When using OpenMP make sure to use a top down approach, i.e. parallelise
loops as high as possible in the call tree, don't parallelise low level
subroutine just because they show up high in profile - use profiling with call
tree (psperf/psuite, gprof) instead. The level at which parallelism is
implemented is one of the limiting factors in the scalability of parallel code
and you may need a significant redesign to get good OpenMP scaling to an
extended number of processes - but if you are forced to redesign, there are
other benefits from using MPI. The need for a top down approach also applies
to MPI, but there you are more-or-less forced to use a top down approach anyway.
- Get support early in the developing process, not when it is too late and
support can only tell you: redesign your program.
If you intend to run within a machine with shared memory (including within a
multicpu cluster node), either OpenMP or MPI should be OK in principle. Here
are a few things to consider in making your choice.
- MPI can usually scale better to a much larger number of processors (partly
because it encourages/requires top down design, but see point 2) and is necessary
if you want to use multiple nodes in one job. If you think that in the lifetime
of a code that you will want to run multi-node then you will need to commit to
MPI.
- The may be significant overhead in running MPI for small numbers of
processors. ie. the scalability might be better but there may be an extra jump
going from 1 serial process to 2 parallel processes because the code needs extra
overhead to run in parallel at all. This may be especially true if you want to
have MPI code that can scale to many processes (20-1000+) - you may need to use
a distinctly different storage model for your working data. If you only want a
moderate number of processes (2-7(-8) on an SX-6 node), OpenMP might be better
(less penalty for entry to parallelism).
- The use of OpenMP may cause memory conflict problems if developed on another
platform and performance may be awful when ported.
- The use of MPI may have been developed on another platform with a
significantly different MPI implementation and characteristic hardware and
performance may be awful when ported.
- The vector performance (or other hardware related performance, say cache)
might be significantly affected by the MPI/OpenMP choice.
- Hybrid OpenMP (intra-node) and MPI (inter-node) is possible but is likely
to need extra work... It would only be worth pursuing if it turned out that
intra-node OpenMP was much better than intra-node MPI (say on 7-8 SX-6 processors).
If you have both OpenMP and MPI implementations available in an application,
the following procedure is recommended. You can modify this procedure to apply
to performance tuning while porting an application that uses either OpenMP or
MPI.
- start in serial and profile for general (and vector) performance
- try each of MPI and OpenMP and see if either is trivial or difficult
- check the scalability on 2,4,6(,14) processors
- compare the profiles with the serial case and look for major performance
changes (including vectorization) or major parallelization overhead
- decide on the most promising option
- optimize any glaring bottlenecks (maybe with HPCCC or vendor support)
- re-check scalability
Note that parallel jobs that are tightly coupled are sensitive to effective
processor scheduling, that is, if the parallel job does not get the right number
of cpus simultaneously it may 'waste' an excessive amount of time.
Running/spawning background tasks in a parallel job is to be avoided as the
extra tasks are likely to interfere with the main parallel job. If you do need
to spawn tasks you can separate them into new jobs (spawn a qsub to submit the
extra job). Submission of jobs will provide slower startup, less coupling of
the jobs, but better and more reliable elapsed times.