padraic.xyz

Shared Memory Parallelism

Shared memory parallelism is useful for running parallelised code on a single machine or node. This is implemented using the library OpenMP, which in turn has extensions for running code with distributed memory parallelism. OpenMP is supported in C, C++ and fortran though most languages will implement some kind of parallelism.

Shared Memory Parallelism is based on threads with a shared memory space. A master thread is started, which is respoinsible for running the main programme. When the programme reaches a parallelised chunk, a set of threads are forked and used together with the master thread. At the end of the parallelised section, these threads or either put to sleep or killed if no longer needed. These are typically used for loops with independent iterations.

The basic syntax of OpenMP is based around #pragma preprocessor instructions. In particular, it uses #pragma once around include statements, to guarantee the code is only incldued once, and #pragma omp statements. These are provided by the OpenMP library, which provides utility functions such as omp_get_num_threads() and is included with #include <omp.h>. When compiling, you then provide the flag -fopenmp.

Here's an example of a Hello World in OpenMP: :::C++ #include #include

int main(int argc, char ** argv)
{
    #pragma omp parallel
    {
        int threadnum = 0;
        int numthreads = 0;
        threadnum = omp_get_thread_num();
        numthreads = omp_get_num_threads();
        std::cout << "Hello World, I am " << threadnum
            << " of " << numthreads << std::endl;
    }
}

though one problem here is that the std::cout class is NOT threadsafe, meaning the output is garbled and looks like Hello World, I am Hello World, I am Hello World, I am Hello World, I am 0132 of of of of 4444. Additionally, every single thread calls omp_get_num_threads() and has a local copy of the result, which is wasteful.

To make out code more sensible, we need a 'threadsafe' way of outputing the data. We also need to provide a fallback if OpenMP is not available. The latter can be achieved using the preprocessor statement #ifdef _OPENMP, #endif. Code inside this block is included only if OpenMP is. We then have to force our output block to be protected, such that only one thread can access it at a time. This is achieved with the statement #pragma omp critical. The resulting improved pogramme looks like :::C++ #include #ifdef _OPENMP #include #endif

int main(int argc, char ** argv)
{
    int threadnum = 0;
    int numthreads = 0;
    #pragma omp parallel shared(numthreads), private(threadnum)
    {
        #ifdef _OPENMP
            threadnum = omp_get_thread_num();
            #pragma omp single
            {
                numthreads = omp_get_num_threads();
            }
        #endif
        #pragma omp critical
        {
            std::cout << "Hello World, I am " << threadnum <<
                " of " << numthreads << std::endl;
        }
    }
}

Note that this also introduces notation for variables to be shared and private between threads.

Now we want to look at how we can use OpenMP to improve a for-loop. The example we will look at calcualtes pi by performing numerical integration. This is made up of a map-reduce like pattern.

We begin by defining our parallel scope with #pragma omp, including private, shared and firstprivate variables. The first two are self-explanatory. The latter is new, except it says the first thread initialises the value for all other threads. We then 'reduce' the loop results down using a #pragma omp critical block.The resulting code looks like: :::C++ #include

int main(int argc, char ** argv)
{
    double pi,sum,x;
    const int N = 10000000;
    const double w = 1.0/N;

    pi = 0.0;
    sum = 0.0;
    #pragma omp parallel private(x), firstprivate(sum), shared(pi)
    {
        #pragma omp for
        for (int i = 0; i < N; ++i)
        {
            x = w*(i-0.5);
            sum = sum + 4.0/(1.0 + x*x);
        }
        #pragma omp critical
        {
            pi = pi + w*sum;
        }
    }
    std::cout << "Result is " << pi << std::endl;
}

Caveat: By default, the OpenMP scope of a variable is shared and this can lead to w e i r d b u g s.

Caveat 2: We have to be aware of 'Race Conditions' which occur when multiple threads are trying to access the same objects.

The reduction pattern arises so commonly in OpenMP that we can in fact define it using the #pragma omp statement. Writing #pragma omp ... reduction(+:sum) tells the compiler to add the different values of sum together to the main thread varibale sum, replacing the previous #pragma omp critical block we had above with simply pi=w*sum.

Going Beyond Trivially Parallelisable

In many cases, there are dependencies between threads involved in the compute. In this case, we need to use critical and thread local variables to avoid errors. There a multiple examples of these kinds of bugs, and they can be tricky to diagnose.

There are several tricks we can use to guarantee synchronisation.

  1. #pragma omp barrier: This defines a barrier where threads will wait until all threads have reached it. The end of an omp scope is implicitly a barrier, though it an be disabled with nowait.
  2. #pragma omp critical: Only one thread can execute this block at a time, which allows us to protect non-threadsafe code
  3. #pragma omp single: Only the first thread to reach this code block executes it.
  4. #pragma omp master: Same as single, but only the master thread executes.
  5. #pragma omp atomic: Protect a variable by changing it in one step.

Mutex Locks

Mutex (short for mutually exclusive) locks are a way to manage resources and protect threadsafe code in concurrent programmes. The way a lock works is that a thread tries to set the lock upon reaching the code block. If the lock is not held by another thread, then the thread sets the lock and executes the code. If another thread then tries to set the lock, it must wait until the first thread finishes and releases the lock.

It is sometimes useful to set multiple locks for multiple resources. This is fine, but you must be careful to avoid a deadlock situation, where two threads each hold one lock and are waiting for the other to be released.

Strategies and Load Balancing

Consider the forloop example again, where we had 100000 iterations split over 4 cores. In this case, there are two obvious strategies.

The first would be to split into 25000 iterations per core and let each thread run. This is fine if the cost is independent of our position in the loop, and has little overhead to manage, but threads can desynchronise if cost depends at all on i.

Alternatively, we could pass one iteration at a time to a waiting thread. This would solve any dependence on i, but it has a significant overhead.

OpenMP defines a set of stragies for handling how data is shared between threads. These are:

Tasks