Posix Thread (pthread) Programming

Thread Synchronization

Thread Synchronization

The most important aspect of parallel programming that is different from sequential (i.e., using 1 thread) programming is synchronization among the various parallel executing threads that access a common resource (in our case: shared variables) concurrently.
Some access operations are conflicting and these access operations cannot be executed simulateneously
Not all access operations are conflicting, however.
Knowing which accesses are conflicting and when to protect shared variables from concurrent access is the key in writing fast and correct parallel programs.
There are 2 access operations: read and write
One common conflicting operations are:
- Reading by one thread and writing by another thread
- Writing by one thread and writing by another thread
Recall that the global variables can be accessed by all threads
Recall that threads executes simultaneously, meaning the (assembler) instructions at multiple places in the program are executed (by multiple CPUs)

Recall that when 2 parallel threads try to update a shared (global) variable, some updates can be lost:

Thread 1 on Thread 2 on Memory CPU 1 CPU 2 ============== =================== ================= N = 1234 Read N --> 1234 Add 1 --> 1235 Read N --> 1234 N = 1235 Write N Add 1 --> 1235 N = 1235 Write N

Obviously, there is a need to ensure that the parallel thread execution will not produce results that is different from the "normal" result.
This problem when parallel thread execution can produce abnormal results is called (thread) synchronization problem

Common synchronization methods in Shared-Memory programming

Mutex (or Mutually exclusive locks):
Read/Write Locks
Binary and counting semaphores
Barriers
Conditional variables (test-and-set operation)

Synchronization methods available in PThread
- Not every synchronization methods in CS theory are implemented in PThreads
- Available synchronization methods in POSIX Threads:
- Binary semaphores, Counting semaphores and Barriers can all be implemented using conditional variables
  (CS355 students have not seen semaphores (operating system stuff), so I will skip these topics here)
MUTEX LOCKS

Mutex Locks: Theory

A mutex lock variable has 2 values (states)
- Unlocked
- Locked

A mutex lock is a synchronization object with 2 operations:

Lock
- If the mutex lock is in the unlocked state, the lock will complete (and the thread continues with the next instruction following the lock command). The value (state) of the mutex lock is changed to locked
- If the mutex lock is in the locked state, the thread that executes the lock command will block (it stops execution) until the value (state) of the mutex lock becomes unlocked (When the state of the mutex lock does become unlocked, the lock command will complete and change the state of the mutex lock to locked)
Unlock
- If the mutex lock is in the locked state, the state is changed to unlocked
- If the mutex lock is in the unlocked state, this operation has no effect.
The mutex lock can ONLY be unlocked by the thread had previously locked the mutex !!!

Mutex Locks in PThreads
- Defining a mutex lock variable in Pthreads:
- Initializing a mutex variable:
  After defining the mutex lock variable, you must initialized it using the following function:
  - mutex: is the mutex lock that you want to initialize (pass the address !)
  - attr: is the set of initial property of the mutex lock.
  The most common mutex lock is one where the lock is initially in the unlock.
  This kind of mutex lock is created using the (default) attribute null:
- Locking a mutex:
  - NOTE: although it is possible for pthread_mutex_lock() (and other thread locking functions) to return an error code (see "man pthread_mutex_lock"), it is so rare that we rarely check the return code...
- Unlocking a mutex: (remember: only the thread that has locked the mutex can unlock it !)

Using mutex lock to synchronize access of shared variables

The jargon "synchronize something" (like synchronize access) means to organize (through timing) the different things (access operations) in such a way that there is no conflict.

A common usage of mutex is to synchronize updates of shared (global) variables

Whenever a thread want to update a shared variable, it must enclose the operation between a "lock - unlock" pair.

Example:

int N; // SHARED variable pthread_mutex_t N_mutex; // Mutex controlling access to N void *worker(void *arg) { int i; for (i = 0; i < 10000; i = i + 1) { pthread_mutex_lock(&N_mutex); N = N + 1; pthread_mutex_unlock(&N_mutex); } }

Although many thread are executing simultaneously, the statement pthread_mutex_lock(&N_mutex); ensures that exactly ONE thread is sucessful in locking the mutex variable N_mutex.
This particular thread will then be the only thread that will update the variable N, thus ensuring that N is updated sequential (one thread after another)
Example Program: (Mutex) --- click here
Compare the behavior of this program with the one that does not use MUTEX to control access to N: click here

NOTE:
- Make sure you unlock a mutex after you are done with accessing the share variable(s).
  A common error in parallel programs is to forget the unlock call (especially if the call is made after many statments)... the result is deadlock - you program hangs (no progress)
- Example Program: (Not unlocking a mutex) --- click here

Example parallel program: parallel numerical integration (estimate Pi)

One of the ways that we can estimate the value of Pi is to compute the definite integral:

Integrate( f(x) = 2.0 / sqrt(1 - x*x) , x = 0 to x = 1 )

Maple:
> integrate(2.0 / sqrt(1 - x*x), x=0..1); 3.141592654

We can use the rectangle-rule to approximate the integral, and as a result, the follow program will compute an approximation for Pi

(The rectangle rule was explained in some pre-requisite course such as CS170 --- See: click here)

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // Will determine the accuracy of the approximation w = 1.0/N; sum = 0.0; for (i = 1; i <= N; i = i + 1) { x = w*(i - 0.5); sum = sum + w*f(x); } cout << sum; }

NOTE: although I am integrating f(x) = 2.0 / sqrt(1 - x*x) for x in [0..1], the program can easily changed to integrate any function and any interval....
Example Program: (Sequential program for Pi) --- click here
Compile with:
Run the program with:

To obtain a parallel program we must consider the program were things can be performed concurrently
The best place to look is for loops
Often, a small amount of (shared) information is updated within every execution of the loop.
The program can be speed up when non-conlficting operations are performed concurrently (in parallel), while conlficting operations to the shared information (variable) are performed sequentially

Example:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // Will determine the accuracy of approximation w = 1.0/N; sum = 0.0; for (i = 0; i < N; i = i + 1) { x = w*(i + 0.5); // We can make x non-shared.. sum = sum + w*f(x); // sum is SHARED !!! } cout << sum; }

We can perform the summation mostly in parallel, except adding the value to sum - which must be done serially

There are different way to divide up the work... for example, using 2 threads, we can divide the summation up as follows:

Thread 1 compute the "first half" of partial sum
w*(f(0.5w) + f(1.5w) + f(2.5w) + ... + f(0.5-0.5w) )
and thread 2 compute the "second half" of partial sum
w*(f(0.5+0.5w) + f(0.5+1.5w) + f(0.5+2.5w) + ... + f(1-0.5w) )
Pictorially:

Thread 1 compute the "even stepped" partial sum

w*(f(0.5w) + f((2 + 0.5)w) + f((4+0.5)w) + ... )

and thread 2 compute the "odd stepped" partial sum

w*(f((1+0.5)w) + f((3+0.5)w) + f((5+0.5)w) + ... ) Pictorially:

values added by thread 1 | | | | | | | | | | | | | | V V V V V V V V V V V V V V |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | | | | | | | | | | | | | | values added by thread 2

It turns out that the "even stepped" and "odd stepped" approach of partition is more easier to program in many instances !!!

NOTE: We don't access any array, so the paging problem is not applicable !!!

First parallelization attempt:

/*** Shared variables, but not updated.... ***/ int N; // # intervals double w; // width of one interval int num_threads; // # threads /*** Shared variables, updated !!! ***/ double sum; pthread_mutex_t sum_mutex; // Mutex to control access to sum int main(int argc, char *argv[]) { int Start[100]; // Start index values for each thread pthread_t tid[100]; // Used for pthread_join() int i; N = ...; // Read N in from keyboard... w = 1.0/N; // "Broadcast" w num_threads = ... // Skip distance for each thread sum = 0.0; // Initialized shared variable /**** Make worker threads... ****/ for (i = 1; i <= N; i = i + 1) { Start[i] = i; // Start index for thread i if ( pthread_create(&tid[i], NULL, PIworker, &Start[i]) ) { cout << "Cannot create thread" << endl; exit(1); } } /**** Wait for worker threads to finish... ****/ for (i = 0; i < num_threads; i = i + 1) pthread_join(tid[i], NULL); cout << sum; }

Worker thread:
void *PIworker(void *arg) { int i, myStart; double x; /*** Get the parameter (which is my starting index) ***/ myStart = * (int *) arg; /*** Compute sum, skipping every "num_threads" items ***/ for (i = myStart; i < N; i = i + num_threads) { x = w * ((double) i + 0.5); // next x pthread_mutex_lock(&sum_mutex); sum = sum + w*f(x); // Add to sum pthread_mutex_unlock(&sum_mutex); } return(NULL); /* Thread exits (dies) */ }

Example Program: (Parallel Pi - version 1) --- click here
- Compile the program using:
- Try run program on compute.mathcs.emory.edu using:
- Then compare the performance numbers with the non-parallel version: click here
  Are you surprised ???

Synchronization bottleneck
- Shared variables are notorous bottlenecks in parallel programs
- Parallel programs are not always faster than sequential programs - parallel programs can be slower due to synchronization overhead (which takes time to execute and also forces threads to stop running)
- Key to writing fast parallel programs is minimize synchronization to the absolute minimum

Improved parallelization:

Worker thread:
void *PIworker(void *arg) { int i, myStart; double x; double tmp; // local non-shared variable /*** Get the parameter (which is my starting index) ***/ myStart = * (int *) arg; /*** Compute sum, skipping every "num_threads" items ***/ for (i = myStart; i < N; i = i + num_threads) { x = w * ((double) i + 0.5); // next x tmp = tmp + w*f(x); // No synchr. needed } pthread_mutex_lock(&sum_mutex); sum = sum + tmp; // Synch only ONCE !!! pthread_mutex_unlock(&sum_mutex); return(NULL); /* Thread exits (dies) */ }

Example Program: (Parallel Pi - version 2) --- click here
- Compile the program using:
- Try run program on compute.mathcs.emory.edu using:
- NOW compare the performance numbers with the non-parallel version: click here
  - Try: time compute_pi 50000000
  - And: time thread_compute_pi_mt2 50000000 8
  What a difference it can make where you put the synchronization points in a parallel program....

READ/WRITE LOCKS

Read/Write Locks: Theory
- A read/write lock variable has 3 values (states)
  - Unlocked
  - Read Locked
  - Write Locked
- A read/write lock is a synchronization object with 3 operations:
  - Read Lock
    - If the read/write lock is in the unlocked state, the read lock will complete (and the thread continues with the next instruction following the read lock command).
      The value (state) of the read/write lock is changed to read locked
    - If the read/write lock is in the read locked state, the thread that executes the read lock command complete (and the thread continues with the next instruction following the read lock command).
      The value (state) of the read/write lock remains read locked, but a count is increased (so we know how many times a read lock operation has been performed)
    - If the read/write lock is in the write locked state, the thread that executes the read lock command will block (it stops execution) until the value (state) of the read/write lock becomes unlocked
      (When the state of the read/write lock does become unlocked, the read lock command will complete and change the state of the read/write lock to read locked)
  - Write Lock
    - If the read/write lock is in the unlocked state, the write lock will complete (and the thread continues with the next instruction following the write lock command).
      The value (state) of the read/write lock is changed to write locked
    - If the read/write lock is in the read locked state, the thread that executes the write lock command will block (it stops execution) until the value (state) of the read/write lock becomes unlocked
      (When the state of the read/write lock does become unlocked, the write lock command will complete and change the state of the read/write lock to write locked)
    - If the read/write lock is in the write locked state, the thread that executes the write lock command will block (it stops execution) until the value (state) of the read/write lock becomes unlocked
      (When the state of the read/write lock does become unlocked, the write lock command will complete and change the state of the read/write lock to write locked)
- Difference between mutex and read/write locks:
  - Read/write locks is more distinguishing than mutex locks, read/write locks can allow 2 read operations to a shared variable to proceed concurrently
Read/Write Locks in PThreads
- Defining a read/write lock variable in Pthreads:
- Initializing a read/write lock variable:
  After defining the read/write lock variable, you must initialized it using the following function:
  - rwlock: is the read/write lock that you want to initialize (pass the address !)
  - attr: is the set of initial property of the read/write lock.
  The most common read/write lock is one where the lock is initially in the unlock.
  This kind of mutex lock is created using the (default) attribute null:
- Read lock a read/write lock:
  - NOTE: if a thread already hold a write lock on a read/write lock, and performs a pthread_rwlock_rdlock() on that lock, then the outcome if undefined (in order words: do NOT try !)
  - NOTE: A thread may hold multiple concurrent read locks on a read/write lock (that is, successfully call the pthread_rwlock_rdlock() function n times). If so, the thread must perform matching unlocks (that is, it must call the pthread_rwlock_unlock() function n times).
- Write lock a read/write lock:
  - NOTE: if a thread already hold a read lock or write lock on a read/write lock, and performs a pthread_rwlock_wrlock() on that lock, then the outcome if undefined (in order words: do NOT try !)