Shared Clusters Local Userguide

Before Starting

About this Guide

This document contains a list of known problems and a change log which can be checked for a summary of recent updates.

If you have loaded this guide using the index page you will have frames with a table of contents on the left and the userguide on the right. The guide(s) will also work fine without frames but will not be as easy to navigate.

The guides are intended to be introductory in nature though they will provide references to core documentation and detailed site specific information and FAQs.

More formal vendor supplied documentation is referenced here.

Quick Start

Registering to use the Altix and ASC Shared clusters

All users need to be registered to use the ASC shared systems. To use systems residing in ASC Docklands (and ASC software), accounts can be requested by completing the online application form. Only CSIRO users will be registered on the CSIRO systems by default. CAWCR staff from the BoM will be registered on CSIRO systems if they need and request access. To access other CSIRO ASC shared systems (not at Docklands , such as the GPU cluster), CSIRO staff can just ask for access at hpchelp@csiro.au

The front-end hostnames of the computational hosts are:

cherax.hpsc.csiro.au
CSIRO Datastore Host, Large Shared Memory Multiprocessor
burnet.hpsc.csiro.au
CSIRO Capacity Compute Cluster
linuxgpu.csiro.au
CSIRO GPU cluster

CSIRO or BoM collaborators will need to get a CSIRO ident as a collaborator via their research contacts who would then reqest access.

The BoM Sun Constellation system solar is described separately.

Connecting to the Altix and ASC Shared clusters

In general you must use secure shell (ssh) to connect to the Altix and ASC Shared clusters. You may also need VPN software to get into the csiro.au or bom.gov.au network first if you need to connect from outside. See our ssh FAQ item for more detail. An ssh connection can be used in conjunction with X server software for applications that require a graphical interface, or in many cases VNC can be used with an X server running on a cluster head node. VNC can provide a persistent desktop-like session which you can reconnect to after disconnecting and often performs better than other options as the X server is close to the application.

All of the ASC systems use the CSIRO NEXUS (staff) identifiers as usernames, and NEXUS authentication. Your password will be the same as on standard CSIRO systems using the NEXUS user identifier. If you change your NEXUS password on a Windows system, the change will propagate to the ASC systems, with some time delay. Please don't try changing your password on the ASC systems.

For CSIRO users the system uses group names which are typically an acronym for a CSIRO business unit.

We can create other groupnames for projects crossing boundaries, and for projects within business units.

Interactive and Batch

The operating system on all of the systems is Unix. For those who are not familiar with Unix or its concept of shells, processes, etc., you may find the University of Edinburgh Unixhelp System useful.

The systems provide users with a configurable login shell which is used to interpret commands to the system. The shell (and other programs) run in an "environment" which is partly inherited from its parent process and (for shells) reconfigured during the shell startup via a series of files which contain shell commands. Different specifically named files and command syntax are used with different types of shell.

Your account will be set up with an initial environment via a default .cshrc file, and an equivalent .profile or .bash_profile

Most people use a combination of interactive and batch access to the systems to get their work done.

Most computationally intensive work is done as batch jobs, where the work is broken up into separate tasks, where each task is made to correspond with a "shell script" which is a sequence of commands written in a text file. These "batch scripts" (along with a resource requirement) are submitted to the batch system which can then manage scheduling of resources to run a mixture of jobs efficiently. Typically a batch system avoids contention between jobs by not over-allocating resources. This results in good overall system utilization at the expense to individual jobs of waiting in a queue when the system is busy.

Interactive access to the systems is also allowed, for managing and monitoring of batch jobs, file management, development and debugging. Where possible we encourage these activities to be conducted on other platforms, but recognize there are many legitimate reasons for interactive access. Limits are places on interactive session to avoid uncontrolled contention disrupting everybody. Interactive access to batch sessions with dedicated resources is also possible, including use of graphical interfaces.

To customize both interactive and batch shells, edit your .cshrc or .profile files to add whatever features you prefer, but please retain the active lines from the template files in /etc/skel in these respective files. These are there to allow your environment to be kept up to date with system changes.

On each system, any process you run has limits imposed on it via the shell. These include a time limit and a memory use limit. To see what these limits are first check which shell you are using. Enter the command:

ulimit -a
for sh/ksh/bash users.
limit
for csh/tcsh users.

These limit apply to both interactive and batch processes. Batch jobs may have additional limits which are monitored by the batch system. The limits are not published here as they are liable to change, and it is also possible to vary these limits on an 'as needs' basis by project or user.

Getting Help with a problem or query

Before contacting HPCCC/CSIRO ASC please confirm that your problem and workaround is not listed in the known problems.

CSIRO ASC Systems - cherax, burnet, gpu cluster

Users experiencing any problems with CSIRO ASC systems should contact the HELPLINE - 03 8601 3800 (external) / 93 3800 (CSIRO internal) - or

Problems reported via a web browser or email will be entered in our RT (Request Tracker) system, and then can be made visible to the staff most able to solve the problem. You can use the web interface to check progress of a problem and/or follow up the request by replying to email sent to you about the request (in plain text - no html email please). Please only include immediately relevant history in your reply as otherwise the information is duplicated in the system and the problem becomes difficult to follow. We have also written some further guidelines in the faq on using the RT system.

If there is an urgent query out of hours, please contact 0428 108 333 for assistance.

Getting Started

About the Shared Clusters

The Shared clusters provide CSIRO with commodity processor based systems to suit a range of compute-intensive needs, including running licensed software, capacity/ensemble computing and access to GPGPUs.

This guide applies primarily to the general access part of the "burnet" cluster but also applies in general terms to shared ASC clusters which have a similar set up. There are sections for documented variations for cmis-cm and the GPU cluster.

Cherax has a direct connection into burnet's private network and there is ip-forwarding enabled on the head node (to enable accessing externally hosted software licenses), but the nodes are not generally visible from outside the cluster.

The clusters have compilers for C, C++ and Fortran 95 and batch systems to manage workload.

The clusters have many standard Unix application and utility software packages installed. We will generally install any software on receiving user support requests

Connecting to the Shared Clusters

For getting access to the head nodes of the clusters, see the general information on connecting.

The compute nodes are known as n001, n002 ... and gpu001, gpu002 ... Access to the compute nodes is only available by ssh from the login node and only to nodes which are currently allocated to a user for a running job.

Environment

Users have a home directory on the head node which is shared with the compute nodes. There are skeleton 'dot' files in /etc/skel which have template content for ensuring that both interactive and batch environments are customized. The .bashrc and .cshrc files also enable non-login/non-interactive environment customization for bash and (t)csh. If you have environment problems it is a good idea to set aside your 'dot' files and copy new ones from /etc/skel ("cp /etc/skel/.??* $HOME") to see if your customizations are breaking something.

There are different amounts of shell initialization, depending on the shell and method of invocation. Full global environment setup and logout processing is only performed for login shells. This includes PBS batch jobs (with no -S option). Partial processing is done for all bash and (t)csh shells.

The scripts set up the environment variables referring to filesystems and attempt to set reasonable terminal settings and some defaults for GROUP, PBS_QUEUE and PATH.

HOST is set to `hostname -s`, HOSTNAME is set to `hostname` and HOSTTYPE is set to `uname -m`.

Software Setup

To customize your environment to use particular software packages the Environment modules utility is available.

Run module avail to see what software packages are availale and module load package to set up your environment for a particular software package.

File Transfer

For transfer of files to or from the ASC Docklands machines from outside the Docklands, we recommend the use of the commands scp and sftp. These provide encryption of passwords and data.

You may need to use a combination of filesystems within the cluster to manage files used for different purposes (eg. source code and large transient data files). Within the cluster you can cp and mv files between filesystems.

burnet only: The datastore, cherax has a connection into the private cluster network of burnet and can be accessed with the name cherax-cluster. The datastore is directly mounted by each compute node.

Related systems

Cherax hosts the CSIRO Data Store at the ASC Docklands site - see the companion guides to the Altix system and the Data Store. In general, data should be transferred in and out of cherax and not kept on the cluster.

Accounting

The command tracejob extracts usage entries for batch jobs from the logs. It usually needs the -n option to specify how many days worth of logs to consider.

Monitoring Resource Usage

Interactive and batch processes may have time limit and a memory limits imposed.

You can get information about limits and resource usage using the following commands:

Batch Use

All of the ASC systems (except the BoM solar system) use the Torque batch system derived from Portable Batch System (OpenPBS). OpenPBS. Other batch systems (SGE, NQSII, ...) provides similar functionality but with different specific commands and options.

Most jobs require greater resources than are available to an interactive session. A batch job is really a type of shell script containing a set of commands which are executed for you without any "terminal" interaction. Such job scripts must be submitted to the batch job system with the qsub command. The batch system manages efficient scheduling of running the submitted jobs on the available resources. The batch system also allows an interactive mode.

A shell script can be as simple as a sequence of commands written in a file or it can include more sophisticated use of flow control, variable substitution and error recovery. Here are some hints on error recovery in batch scripts.

You submit jobs to the Torque batch system using the command qsub specifying the number of CPUs, the amount of memory, and the length of time needed (and, possibly, other resources). The batch system runs the job when the resources are available, subject to constraints on maximum resource usage.

Interactive Batch Jobs

The qsub -I option will result in an interactive shell being started out on the allocated cpu once your job starts. A submission script is not used in this mode - you must provide all qsub options on the command line.

Your job is subject to all the same constraints and management as any other job in the same queue. In particular, it may be accounted for on the basis of walltime, since you may have dedicated access to the cpus reserved for your request. Don't forget to exit your interactive batch session to avoid leaving cpus idle on the machine, and unavailable to others.

Interactive batch jobs are likely to be used for testing or debugging large or parallel programs, but may also be used to run software that needs interaction to operate effectively or for work that is best done in an interactive mode. Since you want interactive response, it may be necessary to use a high priority queue (shorter jobs) to run promptly.

To use an X display in an interactive batch job, use ssh -X (or -Y) to login to the front-end machine or use a VNC session on the login node (do not change the DISPLAY variable ssh or VNC provides) and then submit your job with at least the following options to make the current DISPLAY environment variable be set in the batch job:

    % qsub -I -v DISPLAY

You will usually need to request some resources to get anough time and memory dedicated for your interactive task.

Basic commands

The basic PBS commands are:

qstat
Standard queue status command. See man qstat for details of options.
qdel jobid
Delete your unwanted jobs from the queues. The jobid is returned by qsub at job submission time, and is also displayed in the qstat output.
qsub
Submit jobs to the queues. The simplest use of the qsub command is typified by the following (PBS) example (Note that the job starts in your home directory so you must "cd" to a sensible directory. Also there is a carriage-return after ./a.out):
   % qsub -l walltime=20:00:00,vmem=300MB
   cd my_dir
   ./a.out
   ^D     (that is control-D)
or
   % qsub -l walltime=20:00,vmem=300MB jobscript
where jobscript is an ascii file containing the shell script to run your commands (not the compiled executable which is a binary file). More conveniently, the qsub options can be placed within the script to avoid typing them for each job:
   % cat jobscript
   #!/bin/sh
   #PBS -l walltime=20:00:00,vmem=300MB 
   cd $PBS_O_WORKDIR
   ./a.out
You submit this script for execution by PBS using the command:
   % qsub jobscript

Notice that the PBS commands are all at the start of the script, that there are no blank lines between them, and there are no other non-PBS commands until after all PBS resources are described. The variable, PBS_O_WORKDIR will be defined in the job as the directory from which qsub was run. This may or may not be where you want to "cd" to.

qsub options of note:

-j oe
Combine the standard output and standard error from the job into one file.

Batch resources

The Shared Clusters are distributed memory cluster systems and users are allocated dedicated portions of the cluster in each batch job. The allocated resources may be on separate nodes without a shared memory space so it can be critical to understand how to request resources and what actually gets allocated.

Note: for requesting GPU resources on the GPU cluster, see GPU cluster specifics.

First, there are two alternative ways to request CPU cores; procs=P and nodes=N:ppn=M

-l procs=P
The integer P specifies the number of cores needed by the job without any restriction on which nodes the cores are allocated on. This form is suitable for parallel jobs that do not need control over co-locality of processes on nodes. It is not suitable for jobs expecting to share memory between tasks and not recommended for strongly coupled parallel jobs which usually benefit from avoiding contention with other jobs. or
-l nodes=nodespec
The nodespec is in the form of a colon separated list, with the most common components being nodes=N:ppn=M, specifying the number and/or type of nodes needed by the job. N is an integer specifying the number of nodes (default 1). M in an integer specifying exactly how many processors to allocate per node (default 1). A distributed parallel job would normally specify all the processors on each node (ppn=12 on burnet. A shared memory (eg. OpenMP) job can only specify nodes=1 and should specify the number of cores with just the ppn=M.

Jobs that specify procs may start sooner than the equivalent jobs that specifies nodes. When in doubt use nodes.

Serial jobs (using one core only) will not usually need to specify either nodes or procs and will probably share a node with other serial jobs, or with parallel jobs that specify the procs resource (subject to memory limitations).

The most important other qsub options for shared clusters are:

-l walltime=??
The total wall time limit for the job. Time is expressed in seconds as an integer, or in the form: [[hours:]minutes:]seconds[.milliseconds]
-l vmem=??GB
The total (virtual) memory limit (accross all nodes) for the job - can be specified with units of "MB" or "GB" but only integer values can be given. This is especially important for allowing serial jobs to co-exits on a node and for jobs requiring large memory to be scheduled on appropriate nodes. There is a small default value.
Your job will only run if there is sufficient free memory so making a sensible memory request will allow your jobs to run sooner. A little trial and error may be required to find how much memory your jobs are using - qstat -f lists the actual usage of jobs. Since not all nodes have the same memory size, it is important to get this right to get the right types of nodes.
-l software=PACKAGE
The software corresponding to PACKAGE is required. Some software has limited numbers of licenses available, or is only deployed on a subset of nodes so you need to tell the scheduler, so the jobs will not be started inappropriately. For jobs that need multiple licenses or software features, the more complex moab resource extension syntax is needed, eg.
"-l gres:matlab+1%statistics_toolbox+1" Currently managed software that may need such scheduling includes:
  • matlab and various matlab toolboxes
  • matlab_distrib_comp_engine
  • cplex
  • mathematica and mathsubkernel
  • comsol and add-on modules
To see all licenses that the scheduler counts, run mdiag -n GLOBAL
Note that -l options may be combined as a comma separated list with no spaces, eg. -lvmem=500MB,cput=20:00.

Within the batch environment there is a variable $PBS_NODEFILE which contains the name of a file containing the names of the nodes allocated to you (with duplicates if you request multiple ppn). You can extract the names from this file and ssh to the allocated nodes during the time while the batch job is running or preferably use the pbsdsh utility or mpirun from openmpi to launch processes on the allocated nodes.

For an overview of jobs on burnet try the command pbstop which gives a terminal based graphical summary of jobs and the nodes they are assigned to. Type 'h' for help and use arrow up/down to scroll. Please do not do more monitoring than necessary as it loads the batch server and can impact on batch system performance.

Queues and Scheduling

Queue Structure

The cluster has a default queue (world), which routes jobs to a small set of queues which indicate broad categories of jobs.

The other entry queues that you may need to use have settings which direct jobs to particular sets of nodes. In particular, the test queue is set up to use nodes which have been set aside for testing, usually after a re-build, while the queue cm is only available CMIS CM group members. If you wish to use the opteron queue, contact the helpdesk for access and advice.

An io queue is available for running (i/o) jobs on the head node which otherwise only supports interactive access.

Jobs that are unable to be scheduled to another queue (perhaps the walltime requested is too great) will land in the seekhelp queue and you should check your resource requests or contact the helpdesk.

Batch Job Scheduler

The batch job scheduler being used is Moab. Detailed knowledge of the scheduler is not necessary to run jobs on the cluster but can help users understand what governs the order in which jobs are run.

Moab works by assigning jobs to "reservations" which consist of a block of cpus for a period of time (defined by the requested walltime). The highest priority idle, queued job will get allocated a reservation in the future (and you can tell what time it will start by) and smaller/shorter lower priority jobs may be "backfilled" around the reservations.

Moab also allows "standing" reservations which can be used to reserve a block of resources for jobs which have particular attributes (eg. short jobs, jobs in a specific queue). Some nodes may be assigned to standing reservations. This is why sometimes queued jobs will not start, even if there are idle nodes, as the jobs does not have the right attributes to run in the reservation(s). This mechanism is used to prevent the cluster being monopolised by long running jobs and to enable better turnaround for (shorter) development work.

There are a number of Moab commands that add a scheduler aware interface to the batch system. These commands have man pages, but also return a brief description with the -h help option and include:

showq
display queued jobs including priority order
showres
display reservations, use -n for a per-node view including standing reservations
mdiag
display detailed information on the state of the scheduler
checkjob
display information about a job, including why it is not running
showstart
display the earliest possible start and completion times for a specified job.

Scheduling issues

The scheduling aims are to:

From a user's perspective, it is very important that you minimize your requests for resources (i.e. walltime and memory). Otherwise your job may be queued longer than necessary. Of course, make sure you request sufficient resources - check your measured usage in the epilogue of your job.

In general, shorter duration jobs make scheduling for reasonable overall turnaround and high system utilization much easier. Also, shorter jobs are less likely to waste resources in the case of system failures. For these reasons, the scheduling is set-up to encourage shorter jobs. In particular extra resources are available to jobs which request a walltime of less than 2, 6 and 48 hours and the priority used to determine which jobs start first is boosted and increases more rapidly for shorter jobs.

Estimating memory requirements

As mentioned above, it's important that you request only as much memory as what your job really needs. To do this you will need to estimate the actual memory usage.

When working with a program that you are familiar with, you often have a general idea how much memory is required for a given problem size.

If it is a program you have been working with for some time, you can have a look at the output file, i.e yourscript.o$JOBID that are created when your jobs terminate. Our system appends to the end of those files information on how much memory have been used, e.g.:

Resources used: cput=01:22:32,mem=23356kb,vmem=120364kb,walltime=01:52:02

"vmem=" shows that memory used was around ~120MB, so for next runs you could request 200MB to be on the safe side.

If it's a new program that you are not familiar with, or a program that you know but you have changed the problem size, you can try to estimate memory consumption using a very short run. The memory consumption in many scientific programs stays constant throughout the time steps. So you could set up your experiment to simulate only let's say 1 day, so that it completes fast, and then submit it with memory overestimated to make sure its usage won't exceed the allocated amount. Since it's going to run for short time (don't forget to set walltime to a small value) it hopefully won't sit in the queue for long. Then once it terminate, you can take the amount of vmem actually used from the script.o$JOBID file, increase it by 10% the security margin, set the walltime to appropriate value and re-submit your experiment.

File Systems

A number of file systems are available, each with a different purpose. Variables are defined at login to refer to the parts of the filesystems available to users.

In the table below: 'properties' denotes the Management attributes of the underlying filesystem: back-up (b), quota (q), nfs (n), local disk (l), job-temporary (j), flush (f), by arrangement (a) and/or migrated (m).

Variable nameproperties purpose
$HOME q, b, n login settings and persistent backed up large capacity storage
$FLUSHDIR q, f, n working files (semi-)persistent between sessions.
Ensure that critical files left here are backed up elsewhere
$DATADIR q, n persistent files for use in multiple jobs
$TMPDIR q, j, n job-temporary files - automatic cleanup.
shared between nodes.
$LOCALDIR q, j, l job-temporary files - automatic cleanup
$STOREDIR q, n, m nfs mount of datastore on cherax

Flushing is implemented on the $FLUSHDIR area based on necessity but with a minimum lifetime of 7 days. Files newer than the minimum lifetime will never be (automatically) flushed.

Users may request limited space in $DATADIR to hold persistent files eg. this would be useful for user-installed software or data files which are to be used repeatedly over a period during which they would otherwise have to be repeatedly archived and restored. Ensure that critical files left here are backed up.

As well as the home filesystem there are local disks on each of the compute nodes. To use them your job must in general copy files in at the start and out at the end. It is important to check for errors in this process.

Users running i/o intensive jobs should use local disk space on the compute nodes where possible, and then copy any data back to the head node by using scp to the head node (or cherax). The environment variable $LOCALDIR will let you access scratch space on the local disk of a compute node.

e.g. to copy files from a compute node to the head node:

    scp file burnet:~

This will copy files to your home directory on the head node.

    scp file burnet:$FLUSHDIR

This will copy files to your flush directory on the head node. To copy files from the head node to the scratch space on a compute node you can do the following:

    scp burnet:~/file $LOCALDIR

burnet only:There is a direct connection set up between cherax and the compute nodes on burnet, to minimise traffic on the head node. To copy files from cherax to a compute node, you should now do the following:

    scp cherax-cluster:~/file $LOCALDIR

To copy files from the scratch space on a compute node, you should now do the following:

    scp $LOCALDIR/file cherax-cluster:~

or

    scp $LOCALDIR/file cherax-cluster:\$FLUSHDIR

Use the quota command to see your usage and the limits on file systems. Note that there are quotas on both the space occupied and the number of inodes (loosely, the number of files).

On burnet $FLUSHDIR and $DATADIR live in the same filesystem and share the same quota. The only difference is that files in $DATADIR are not subject to flushing.

Code Development

Compiling

The Intel compilers are recommended for code development. In most cases, better performance is obtained using the Intel compilers.

  1. The Intel Fortran compiler is ifort.
  2. The Intel C compiler is icc.
  3. The Intel C++ compiler is icpc.

There are multiple versions of the compilers available and you can set up your environment for a particular version using environment modules, eg. module load intel-fc

Recommended Compiler Options (C/C++ and FORTRAN)

Debugging

Generating a stack trace

The ifort and icc/icpc compilers need the -traceback option to trigger a stacktrace if the executable has a floating point error or segmentation violation. This can be very useful as a first indication of where a problem is. Always include this option to aid in debugging.

Debugging with optimizations enabled

Use the -g option to debug with optimizations enabled. Ensure the optimization level (-O2 or -O3 is specified after this option, as -g makes -O0 the default - eg. -g -O2

Handling Floating point exceptions

Use the -fpe0 option for handling floating exceptions. Floating-point underflow is gradual, unless you explicitly specify a compiler option that enables flush-to-zero. This is the default; it provides full IEEE support. (Also see -ftz.) The -fpe0 option will lead to the code aborting at errors such as divide by zeros.

The compiler is very aggressive at high optimization levels. The -fp-model precise option will help avoid floating point calculation problems.

Data access problems

Two common FORTRAN bugs are due to using uninitialized variables and having array bound over-run errors (where memory outside an array is read or changed unintentionally). The ifort options which can be particaultly useful are -nozero, -warn and -check all can be invaluable. Check the man pages for usage details.

Unfortunately the Intel compilers do not have options to initialise new allocated memory to NaN values.

Debugging Software

The GNU gdb debugger works with C, C++ and FORTRAN codes. gdb supports the debugging of simple programs and core files, and code with multiple threads.

  1. Compile and link your program using the -g option e.g.
    	% icc -g prog.c
    	
  2. Start the debugger
    	% gdb ./a.out
    	
  3. Enter commands such as
    	(gdb) run
    	(gdb) print var
    	(gdb) quit
    	
Note: To debug using a core file after a program has crashed, you will need to change the core file limit for your shell prior to running your program:
	csh syntax: limit coredumpsize unlimited
	bash syntax: ulimit -c unlimited
	

Debugging Parallel programs

For debugging parallel (OpenMP and MPI) and more complex codes we recommend using the Totalview debugger. Totalview provides a rich set of debugging features via an interactive GUI (and command-line).

To use the Totalview debugger, compile your code with -g then in an interactive batch job load the totalview module:

	% ifort -g prog.c
	% module load totalview
	% totalview a.out -a arg1 arg2
	
and for MPI codes:
	% mpif90 -g mpiprog.c 
	% module load openmpi totalview
	% mpirun -tv a.out arg1 arg2
	

Using MPI

MPI is a parallel program interface for explicitly passing messages between parallel processes - you must add message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source when you compile and link to the MPI libraries.

There is currently one MPI implementation recommended on the cluster, OpenMPI (although Intel and PGI MPI are used internally by some software). You can use modules to set up your path to use the MPI implementation you require. From time to time there will be multiple versions available (with version specific module options).

OpenMPI documentation is available as online man pages (man mpi) and also from the OpenMPI web site.

MPI Compiling and linking

OpenMPI provides wrapper scripts to the compilers to set appropriate compile and link options. The wrappers are:

Running MPI jobs

MPI jobs under OpenMPI require the use of the mpirun (or equivalent mpiexec) command. OpenMPI will automatically detect when it is running in a batch environment and assign processes to the assigned resources. The policy for placing processes can be modified using arguments to the mpirun command. See man mpirun for details.

#!/bin/bash
#PBS -l nodes=3:ppn=12
cd $PBS_O_WORKDIR
module load openmpi
mpirun ./a.out

Using OpenMP

OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelize loops and specify certain properties of variables.

Compiling and linking

Fortran and C with OpenMP directives are compiled as:

    % ifort -openmp myprog.f -o myprog.exe
    % icc -openmp myprog.c -o myprog.exe

Running OpenMP jobs

To run the OpenMP job interactively, first set the OMP_NUM_THREADS environment variable then run the executable:

    % env OMP_NUM_THREADS=12 ./a.out

For larger jobs and production use, submit a job to the PBS batch system with something like

    % qsub -l nodes=1:ppn=12,walltime=30:00
    #!/bin/sh
    OMP_NUM_THREADS=12
    export OMP_NUM_THREADS
    cd $PBS_O_WORKDIR
    ./a.out
    ^D
    %

Common problems

Two of the most common problems encountered in parallelizing code in shared memory (or porting parallelized code) are stack issues due to the multi-threaded parallel execution model and data scoping issues which may manifest as uninitialized or over-written variables.

Autoparallelizing with the compilers

The intel compilers can automatically parallelize code at the level of loops, to run in shared-memory. The results can be good - but only for a relatively small class of codes. In general, parallelization is most effective when applied at the highest possible level and a better result (than automatic parallelization) can be acheived by adding OpenMP directives and 'helping' the compiler to identify parallelizable code. You can try automatic parallelization and test if you get any speedup. Also you can use the information from the compiler as a start for adding OpenMP directives and to identify code that inhibits parallelism. The intel compiler options are:

     % ifort -parallel prog.f
or
     % icc -parallel myprog.c

By default this reports to the screen which loops were parallelized. More information can be obtained by using the -par_report options. To run on 2 processors do

     % env OMP_NUM_THREADS=12 time ./a.out

The time output will show the cpu and elapsed execution time.

See the Intel Fortran manual for more details on use.

Performance

Profiling

To find which routines are the most time-consuming and where they are called from, compile and link with -p -g to produce an instrumented version of your code. When your program executes, it will generate profiling data stored in a file called gmon.out, which can then be viewed using the gprof program.

	% ifort -p -g -o prog.exe prog.f
	% ./prog.exe
	% gprof ./prog.exe gmon.out
   

Profiling and Performance of MPI programs

MPI Profiling

MPI Tracing is available through Solaris Studio using OpenMPI (module load openmpi/1.3.3 solaris_studio) as well as using the PGI Cluster Development Toolkit (module load pgi)

gprof

There is an undocumented environment variable for use with gprof:

          export GMON_OUT_PREFIX=gmon.blah

which causes profiling outputs (from compiling with -pg) to be named gmon.blah.$$ You can use this to independently profile mpi processes on linux.

Other Documentation

Man pages and the gnu info system are available. The gnu version of man is often set up so that the man page search path takes into account the users current PATH. For commands you usually will get the man page corresponding to the command you would run, so you might need to set up your PATH for the softawre you want to see the docs for. Setting MANPATH can interfere with the automatic behavior and it is usually best left unset. Software often has documentation included in the distribution and installation. You can look in the install tree on the system, eg. /tools/mpich/doc and /usr/share/doc.

cmis-cm Specifics

The CMIS CM group has located some of their hardware in burnet. These nodes have extra disk with the users having persistent space in /scr and /scr1. The CM group can run jobs on these nodes by submitting to the cm queue and can submit jobs to the remainder of the cluster using the general queue. The node blade110 allows interactive access via ssh from burnet.

GPU cluster Specifics

The GPU cluster is located in the ACT. The hostname of the login node is linuxgpu.csiro.au Nodes each have 8 cores (2 quad core cpus) and access to two tesla GPUs. PGI compilers are available which support acceleration directives

Requesting GPUs in addition to CPU cores

A number of different qsub resource specifications are suitable depending on your jobs requirements.

-l nodes=N:ppn=8
Requesting entire compute nodes will give you exclusive access to both GPUs on that node so no additional syntax is required
-l nodes=N:ppn=1,gres=gpu
If you don't need a whole node, a single GPU can be reserved with each core by the addition of gres=gpu to your request. This mostly makes sense with serial jobs with the default nodes=1:ppn=1.
-l nodes=N:ppn=2,gres=gpu
The gres=gpu feature is applied for each CPU core requested, so it can be used with at most ppn=2 (as there are only two GPUs per compute node).
-l nodes=N:ppn=1,gres=gpu:2
You can also request multiple GPUs per CPU core. This only currently makes sense if you want to reserve both GPUs but only use one, and make sure the scheduler will not run another GPU job on the nodes.

Requesting gres=gpu:3 or gres=gpu with ppn greater than 1 will fail as it would require more than 2 GPUs per compute node.

It is not possible using the currently available syntax to request other combinations such as 2 CPU cores and 1 GPU.

In such cases where you need more CPU cores than GPUs, simply request the whole node.

Finally, as two jobs in a single node can be allocated cores on a single CPU, it is possible to get bandwidth contention to the GPUs. Poorly written codes can even use the wrong GPU, so if this is a concern use any of the options above that request both GPUs or the whole node.

Known Problems

  1. My program aborts with SIGKILL

    Symptom: My program aborts with a SIGKILL signal but the batch system reports no other errors.

    Possible Cause: Some applications require more than the default stack space (currently 8mb). Increase the stack limit using 'ulimit -s <size>' (for sh/bash/ksh) or 'limit stacksize <size>' (for csh/tcsh), where size is specified in kbytes.

    Note: MPI applications will need a 'wrapper' script, started by mpirun/mpiexe to set the limit and then start the application.

Changelog

Here is a list of recent updates in this userguide for quick reference for users returning to this guide.

To Do

Here is a list of pending updates to this userguide.