Distributed Memory Parallelism

Distributed memory parallelism can be understood as parallelisation across 'machines'; each process has it's own independent memory, and processes can send messages between each other.

This is achieved using a protocol called Message Passing Interface, MPI. In pratice, MPI is implemented as a C librarry with bindings to Fortran, Python (with boost or mpi4py), R and C++(with boost). MPI programmes are called with a command line programme, mpiexec, and vendors will provide wrappers around compilers to ease compilation.

The MPI library is C< and so when passing custom types for example we need to use void and a 'cast' method to which we pass our datatype.E.g int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm). The final argument to ths function is the communicator, the object which handles message passing.

To compile MPI programmes, we use a command named mpic++, potentially also referred to as mpicc or mpicxx. This programme handles proper linking to mpi. We then run the command with mpiexec. We can even specify the number of processors available to mpiexec using the flag -n <number>. The hello_world example is shown below

#include <mpi.h>
// Next line tells CATCH we will use our own main function
#include "catch.hpp"

TEST_CASE("Just test I exist") {
    int rank, size;
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);
    CHECK(size > 0); CHECK(rank >= 0);

int main(int argc, char * argv[]) {
    MPI_Init (&argc, &argv);
    int result = Catch::Session().run(argc, argv);
    return result;

Any MPI calls have to come between MPI_Init and MPI_Finalize. The communicator MPI_Comm handles message passing between a given group of processes. The communicator knows both the size of the group and the rank of a process (order). By convention, process 0 in a group is special and is called the root.

Point to Point Communication

This refers to message passing between two processes, for example to process data or report success. Common examples include:

  1. Blocking Synchronous Send: A drops off a message and waits until it receives a received receipt from process B. Name is MPI_SSend.
  2. Blocking Send: A drops of the message and then carries on, while process B waits for the data. Name is MPI_Send.
  3. Non-blocking Send: Here, process A stores the data in the safebox and doesn't need to wait for the transit to being. The data is transmitted to B, and a receipt is left in the safebox to be accessed later by process A. Name is MPI_ISend.

The questions involved are: how long do I have to wait, and when can I start modifying data used for the message? This will depend entirely on the nature of your calculation.

As a shorthand, if there is no method name prefix this is a blocking call, if the call looks like MPI_S<name> it is synchronous, and if it is MPI_I then it is asynchronous. For each Send call we also need a receive for Process B, which includes a preallocated buffer and a pointer to a status variable.

Dead Lock

Again, it is important to be aware of accidentally creating 'Dead Lock' situations, here two processes are stuck waiting for each other to proceed. This is especially difficult to debug as the 'speed' of each process will vary run to run.

Collective Communications

In this case, we are doing more than 1-to-1 communication between our processes. There are multiple different schemes of which we will review only a few. Here are a few examples:

  1. Broadcast: Used e.g. in setting up a calculation, root sends data to the other processes.
  2. Gather: Other processes send their data to root, usually used for results.
  3. Gather All: All processes have access to all data
  4. Scatter: All processes have a certain type of data, so there is many-to-many communication.
  5. Reduce: Data is sent to root and a binary operation is used to process the results down to an 'accumulator'.

Synchronisation is achieved using the MPI_Barrier method to hold all processes to a common point.