OpenMP:

What is OpenMP ?
- OpenMP is a portable, parallel programming model for shared memory multiprocessor architectures, developed in collaboration with a number of computer vendors.
- OpenMP API are available for C/C++ (and Fortran)
- For more information on the OpenMP developer community, including tutorials and other resources, see their web site at: http://www.openmp.org
How to write OpenMP programs
- OpenMP programs are ordinary Fortran or C/C++ proograms that include compiler INSTRUCTIONS (directives) to tell the Fortran or C/C++ compiler to generate parallel execution code (using threads)
- Syntax of OpenMP compiler directive for C/C++ compilers:
  - Program statements between the { and } are executed by multiple threads
- The number of threads that will be created to execute parallel sections is controlled by the environment variable OMP_NUM_THREADS
How to compile OpenMP programs
- C/C++
  - Compile: CC -O -c -xopenmp Prog.C
  - Link: CC -O -o Executable -xopenmp Prog1.o Prog2.o ....
Introductory Example
- Before I continue, I like to show you an example of an Open MP program so you have some idea what Open MP can do for you....
  - Example Program: (OpenMP Hello World) --- click here
  - Compile with:
  - Run with:
    You will see "Hello World !!!" printed EIGHT times !!! (Remove the #pragma line and you get ONE line)....
- What is happening in an OpenMP program ?
  - Before the directive #pragma omp parallel the program is single-threaded - this thread is also known as the MASTER THREAD
  - The directive #pragma omp parallel tells the compiler to create a thread team. The number of threads in the team is determined by the environment variable OMP_NUM_THREADS
  - At the end of the parallel section, ALL threads are implicitly joined (i.e., the master thread calls "pthread_join" on all worker threads)
    Then the master thread T₀ continue with the single-threaded execution following the PARALLEL section.

Shared and private variables in OpenMP programs

So in essence, an OpenMP program is nothing more than a multi-threaded program that you have seen when we discussed pthreads
Just like in pthreads, there are 2 different kinds of variables:
- Shared variables where a single copy of the variable exists and all threads access that single copy
- Private variables where a copy exists for each thread and each thread accesses its own copy of that variable.

How to tell which variables are shared/private

Variables without any directives that are defined OUTSIDE the scope of a parallel region but are visible inside the scope of a parallel region are by default SHARED
Variables without any directives that are defined INSIDE the scope of a parallel region are NON-SHARED (only possible is languages with scoping rules - C/C++)

C/C++ example of SHARED variable:

#include <omp.h> int main(int argc, char *argv[]) { int N; // Variable defined OUTSIDE parallel region.... // It is therefore SHARED N = 1001; cout << "Before parallel section: N = " << N << endl; #pragma omp parallel { N = N + 1; cout << "Inside parallel section: N = " << N << endl; } cout << "After parallel section: N = " << N << endl; }

Example Program: (Shared variable in OpenMP) --- click here
- Compile with: CC -xopenmp openMP02a.C
- Run with: export OMP_NUM_THREADS=8; a.out (make sure you run on compute.mathcs.emory.edu - an 8-way multi-processor)
You should see the value for N at the end is not always 1009, it could be less. You have seen this phenomenon before in threaded programs when multiple-threads update a shared variable concurrently... The same things is happening here.

C/C++ example of NON-SHARED (private) variable:

#include <omp.h> int main(int argc, char *argv[]) { #pragma omp parallel { int N; // Variable defined INSIDE parallel region.... // It is therefore NON-SHARED N = 1001; N = N + 1; } // ERROR if you try to do this: // cout << "N = " << N << endl; // because N is not defined in the outer scope !!! }

Example Program: (Shared variable in OpenMP) --- click here
- Compile with: CC -xopenmp openMP02b.C
- Run with: export OMP_NUM_THREADS=8; a.out (make sure you run on compute.mathcs.emory.edu - an 8-way multi-processor)
You should see the value for N at the end is not always 1009, it could be less. You have seen this phenomenon before in threaded programs when multiple-threads update a shared variable concurrently... The same things is happening here.

OpenMP Support function

In addition to compiler directives (pragmas), there are a number of support functions in OpenMP:

Function Name	Effect
omp_set_num_threads(int nthread)	Set size of thread team
INTEGER omp_get_num_threads()	return size of thread team
INTEGER omp_get_max_threads()	return max size of thread team (typically equal to the number of processors
INTEGER omp_get_thread_num()	return thread ID of the thread that calls this function
INTEGER omp_get_num_procs()	return number of processors
LOGICAL omp_in_parallel()	return TRUE if currently in a PARALLEL segment
*omp_init_lock(omp_lock_t lock)**	Initialize the mutex lock "lock"
*omp_set_lock(omp_lock_t lock)**	Lock the mutex lock "lock"
*omp_unset_lock(omp_lock_t lock)**	Unlock the mutex lock "lock"
omp_test_lock(omp_lock_t *lock)	Return true if the mutex lock "lock" is locked, returns false otherwise

Here is a simple OMP program in C++:

#include <iostream.h> #include <omp.h> // Read in OpenMP function prototypes int main(int argc, char *argv[]) { int nthreads, myid; #pragma omp parallel private (nthreads, myid) { /* Every thread does this */ myid = omp_get_thread_num(); cout << "Hello I am thread " << myid << endl; /* Only thread 0 does this */ if (myid == 0) { nthreads = omp_get_num_threads(); cout << "Number of threads = " << nthreads << endl; } } return 0; }

Example Program: (OpenMP C++ program) --- click here
1. Compile using the following command:
  - CC -xopenmp hello.C

Example OpenMP Program: Find minimum in an array

We have seen a program to find the minimum value in an array before ( click here )
We will write the same program - but using the OpenMP pragmas

The structure of the program is as follows:

/* Shared Variables */ double x[1000000]; // Must be SHARED (accessed by worker threads !!) int start[100]; // Contain starting array index of each thread double min[100]; // Contain the minimum found by each thread int num_threads; int main(...) { for (i = 0; i < MAX; i++) x[i] = random()/(double)1147483648; // --------------------------------------------------- // Tell each thread where to start searching for min // ---------------------------------------------- for (i = 0; i < num_threads; i = i + 1) start[i] = i; #pragma omp parallel { ... Thread i finds its minimum and ... store the result in min[i] } // ---------------------------------------- // Post processing: Find actual minimum // ---------------------------------------- my_min = min[0]; for (i = 1; i < num_threads; i++) if ( min[i] < my_min ) my_min = min[i]; }

The code of the parallel region is as follows:

/* Shared Variables */ double x[1000000]; // Must be SHARED (accessed by worker threads !!) int start[100]; // Contain starting array index of each thread double min[100]; // Contain the minimum found by each thread int num_threads; int main(...) { for (i = 0; i < MAX; i++) x[i] = random()/(double)1147483648; #pragma omp parallel { int id; int i, n, start, stop; double my_min; n = MAX/omp_get_num_threads(); // step = MAX/number of threads. id = omp_get_thread_num(); // id is one of 0, 1, ..., (num_threads-1) start = id * n; if ( id != (num_threads-1) ) { stop = start + n; } else { stop = MAX; } my_min = x[start]; for (i = start+1; i < stop; i++ ) { if ( x[i] < my_min ) my_min = x[i]; } min[id] = my_min; // Store result in min[id] } // ---------------------------------------- // Post processing: Find actual minimum // ---------------------------------------- my_min = min[0]; for (i = 1; i < num_threads; i++) if ( min[i] < my_min ) my_min = min[i]; }

Example Program: (Demo above code)
- Prog file: click here
Compile with:
Run with (on compute):

Syncrhonization Primitives (functions)
- There are a number of synchronization methods in OpenMP, but for this short intro, I will limit to one: mutual exclusion.
- Recall that concurrent updates to shared variables must be done within a pthread_mutex_lock and pthread_mutex_unlock pair.
- This mutual exclusion effect is achieved in OpenMP using the following pragma:

Example OpenMP program with synchronization: compute Pi

Consider the following sequential program that estimate Pi by integrating 2.0 / sqrt(1 - x*x):

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // accuracy of the approximation w = 1.0/N; sum = 0.0; for (i = 1; i <= N; i = i + 1) { x = w*(i - 0.5); sum = sum + w*f(x); } cout << sum; }

Example Program: (Sequential program for Pi) --- click here
Compile with:
Run the program with:
(We have seen this program before, so I will not explain it again: click here )
We will parallelize the for-loop using OpenMP

Parallel Open MP program to estimate Pi in C/C++:
When we parallelize, it is important to know which UPDATES must be SYNCHRONIZED:
- When a variable is updated by different threads, the various updates must be synchronized
Updates that need to be synchronized:
- The only variable that are updated by different threads is: sum
- So updates to sum must be synchronized.

First Version of the OpenMP program to compute Pi:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int N; double sum; // Shared variable, updated ! double x, w; N = ...; // accuracy of the approximation w = 1.0/N; sum = 0.0; #pragma omp parallel { int i, num_threads; // Non-shared variables !!! double x; num_threads = omp_get_num_threads() ; for (i = omp_get_thread_num(); i < N; i = i + num_threads) { x = w*(i + 0.5); #pragma omp critical { sum = sum + w*f(x); } } } cout << sum; }

Example Program: (First version of OpenMP to compute Pi) --- click here
Compile with:
Run with:
Change OMP_NUM_THREADS and see the difference in performance

Problem with the first version:
- Too many synchronizations operations !!!
- Notice that the CRITICAL pragme is INSIDE the for-loop
To eliminate synchronizations, we use private variables

Improved Version of the OpenMP program to compute Pi:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int N; double sum; // Shared variable, updated ! double x, w; N = ...; // accuracy of the approximation w = 1.0/N; sum = 0.0; #pragma omp parallel { int i, num_threads; double x; double mypi; // Private variable to reduce synchronization num_threads = omp_get_num_threads() ; mypi = 0.0; for (i = omp_get_thread_num(); i < N; i = i + num_threads) { x = w*(i + 0.5); mypi = mypi + w*f(x); } #pragma omp critical { sum = sum + mypi; } } cout << sum; }

Example Program: (OpenMP compute Pi) --- click here
Compile with:
Run with:
Change OMP_NUM_THREADS and see the difference in performance

Other Parallel constructs in OpenMP
- So far, what you have seen is exactly what is achieved using parallel exeuction with threads
- The PARALLEL is the most basic construct in OpenMP (every thread executes the code within the PARALLEL construct.
- There are other more convenient PARALLEL constructs that help you write parallel programs with more ease
- Here is a list of all PARALLEL (work-sharing) constructs in OpenMP:
  - Parallel loop (DO or for)
  - Parallel section
  - General work-sharing (Fortran only)
- I will only discuss the PARALLEL LOOP construct because that is where the most common opportunity for speeding up a program.

Parallel For Loop in OpenMP

From what you have seen so far, you can conclude that for-loops is a great place to speed up a program through concurrent execution.
It is a very simple and mechanical process to divide up the work over a number of threads simply by scheduling a different thread to work on the for-body using a different index value.
The division of labor (splitting the work of a for-loop) of a for-loop can be done in OpenMP through a special Parallel LOOP construct.
A Parallel Loop construct MUST appear within a Parallel region of the program !

The syntax of a Parallel LOOP construct is:

#pragma omp parallel { .... #pragma omp for [parameters] for-statement // Parallel Loop .... }

The meaning of this Parallel LOOP construct is to distribute the iterations in the for-loop (or do-loop) among the threads.
Each iteration of the for-loop is executed exactly once by each thread.
The loop variable used in the Parallel LOOP construct is by default PRIVATE (other variables are still by default SHARED)

Example:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int N; double sum; // Shared variable, updated ! double x, w; N = ...; // accuracy of the approximation w = 1.0/N; sum = 0.0; #pragma omp parallel { int i; double mypi, x; mypi = 0.0; #pragma omp for for (i = 0; i < N; i = i + 1) { x = w*(i + 0.5); // Save us the trouble of dividing mypi = mypi + w*f(x); // the work up... } #pragma omp critical { sum = sum + mypi; } } cout << sum; }

Notice we write the for-loop as if we were summing the integration rectangle in a sequential program
The C/C++ compiler will insert instructions that distribute the execution of the each iteration of for-loop to some thread - it is no longer your problem to "skip" index count to accomplish load distribution !
Example Program: (OpenMP compute Pi) --- click here
Compile with:
Run with:
Change OMP_NUM_THREADS and see the difference in performance

Another OpenMP program - I grabbed this off the Internet
- OpenMP program of matrix-vector multiplication: click here
Final Notes
- The stack size of each thread can be controlled by setting another environment variable:
- For more information on OpenMP, see: http://www.openmp.org