UserGuide Home NCI-NF BoM SUN Altix Data Store Clusters Condor BoM SamFS Software Map FAQ

How should I choose between using OpenMP or MPI for parallel computing?

OpenMP and MPI are the main standards used for parallel programming for numerically intensive computing. The model for parallelism is related relatively closely to the hardware the model runs on. OpenMP uses a shared memory model which is generally only appropriate on computing hardware where the processors see a shared memory. MPI (Message Passing Interface) is appropriate for clusters, where the memory (and processors) are distributed and shared memory is not a core feature of the architecture. MPI can be readily (and quite effectively) implemented in shared memory, but emulating shared memory in distributed memory has a large overhead and generally OpenMP cannot be effective on a distributed memory system.

The effectiveness of parallel programmimng is critically influenced by relative latency and bandwidth for transferring data between processors (cache) and local and remote memory which depends on the hardware available (more generally performance is influenced by a hierarchy of access speeds to cache, memory, high performance interconnect, disk and network.) The availability of shared memory, memory-memory interconnect and performance networking hardware is key. A particular parallel implementation of a model may perform very differently (or need very different tuning) when ported to different hardware and OpenMP/MPI implementation.

The over-riding factor in choosing between OpenMP and MPI is availability of hardware. Since MPI can run on a greater variety of hardware, it may be the best long-term choice, especially when starting from scratch on code which is intended to have an on-going life.

If you have hardware with shared memory, then you can choose between OpenMP and MPI. There are a number of starting points that will profoundly influence the choice.

If you are developing or porting code, considerations about the difficulty of working with each model of parallelism are important.

If you intend to run within a machine with shared memory (including within a multicpu cluster node), either OpenMP or MPI should be OK in principle. Here are a few things to consider in making your choice.

  1. MPI can usually scale better to a much larger number of processors (partly because it encourages/requires top down design, but see point 2) and is necessary if you want to use multiple nodes in one job. If you think that in the lifetime of a code that you will want to run multi-node then you will need to commit to MPI.
  2. The may be significant overhead in running MPI for small numbers of processors. ie. the scalability might be better but there may be an extra jump going from 1 serial process to 2 parallel processes because the code needs extra overhead to run in parallel at all. This may be especially true if you want to have MPI code that can scale to many processes (20-1000+) - you may need to use a distinctly different storage model for your working data. If you only want a moderate number of processes (2-7(-8) on an SX-6 node), OpenMP might be better (less penalty for entry to parallelism).
  3. The use of OpenMP may cause memory conflict problems if developed on another platform and performance may be awful when ported.
  4. The use of MPI may have been developed on another platform with a significantly different MPI implementation and characteristic hardware and performance may be awful when ported.
  5. The vector performance (or other hardware related performance, say cache) might be significantly affected by the MPI/OpenMP choice.
  6. Hybrid OpenMP (intra-node) and MPI (inter-node) is possible but is likely to need extra work... It would only be worth pursuing if it turned out that intra-node OpenMP was much better than intra-node MPI (say on 7-8 SX-6 processors).

If you have both OpenMP and MPI implementations available in an application, the following procedure is recommended. You can modify this procedure to apply to performance tuning while porting an application that uses either OpenMP or MPI.

  1. start in serial and profile for general (and vector) performance
  2. try each of MPI and OpenMP and see if either is trivial or difficult
  3. check the scalability on 2,4,6(,14) processors
  4. compare the profiles with the serial case and look for major performance changes (including vectorization) or major parallelization overhead
  5. decide on the most promising option
  6. optimize any glaring bottlenecks (maybe with HPCCC or vendor support)
  7. re-check scalability
Note that parallel jobs that are tightly coupled are sensitive to effective processor scheduling, that is, if the parallel job does not get the right number of cpus simultaneously it may 'waste' an excessive amount of time. Running/spawning background tasks in a parallel job is to be avoided as the extra tasks are likely to interfere with the main parallel job. If you do need to spawn tasks you can separate them into new jobs (spawn a qsub to submit the extra job). Submission of jobs will provide slower startup, less coupling of the jobs, but better and more reliable elapsed times.

Last updated: 31 Jan, 2012
Email problems, suggestions, questions to hpchelp@csiro.au
Thanks to NCI-NF for the userguide structure.