MPI's Group Communication Primitives

Group Communication

Often, some information must be shared by all the processes in the communication group
It is therefore convenient to have Communication Primitives that to communicate with all the processes in the communication group using ONE SINGLE function call .

Most of the group communication functions (primitives) have in common are:

The group communication primitive must be called by all processes in the communicator group
The group communication primitive is synchronously.
(In other words: the messages must be received by all members of the comm. group before the function will return).

Group Communication Primitives

Here are the list of MPI's group communication primitives :

MPI_Bcast (sendbuf, sendcount, sendtype, rootID, communicator)
MPI_Scatter (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, rootID, communicator)
MPI_Gather (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, rootID, com municator)
MPI_Reduce (sendbuf, recvbuf, count, type, op, rootID, communicator)

These functions are best illustrated with some examples

The Broadcast Group Communication Primitive

Broadcasting is sending a message to all members (including yourself) of the group

Syntax of the MPI_Bcast() call:

MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int rootID, MPI_Comm comm )

Effect of the MPI_Bcase() function:

The MPI process "rootID" sends the data in "buffer" to all other members in the communication group "comm"
Illustrated:

Psuedo code that describes what happens in a MPI_Bcast():

if ( myID == rootID ) { for ( every ID i in the communication set "comm" ) { MPI_Send( buffer, count, datatype, i, TAG, comm); } } else { MPI_Recv( buffer, count, datatype, rootID, TAG, comm); }

Example 1: source is node 0

Example 2: source is node 1

Example code:

int main(int argc, char **argv) { char buff[128]; int secret_num; int numprocs; int myid; int i; MPI_Status stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); // ------------------------------------------ // Node 0 obtains the secret number // ------------------------------------------ if ( myid == 0 ) { secret_num = atoi(argv[1]); } // ------------------------------------------ // Node 0 shares the secret with everybody // ------------------------------------------ MPI_Bcast (&secret_num, 1, MPI_INT, 0, MPI_COMM_WORLD); if ( myid == 0 ) { for( i = 1; i < numprocs; i++) { MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat); cout << buff << endl; } } else { sprintf(buff, "Processor %d knows the secret code: %d", myid, secret_num); MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); }

Example Program: (Demo above code)
- Prog file: click here
Demo instruction:
1. ssh2 puma
2. cd sunhome/teaching/web/355/Syllabus/92-MPI/Progs
3. mpiCC -o BCast BCast.C
4. mpirun -np 4 ./BCast

The Scatter Group Communication primitive

The MPI_Scatter() is used to split an array into N parts and send one of the parts to each MPI process

Syntax of the MPI_Scatter() call:

MPI_Scatter(void* sendbuf, // Distribute sendbuf evenly to recvbuf int sendcount, // # items sent to EACH processor MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int rootID, // Sending processor ! MPI_Comm comm)

sendbuf - the data (sendbuf in the rootID processor must have valid data)

sendcount - number of items sent to each process (valid for rootID only)

Items [0, (sendcount - 1)] are sent to processor 0
Items [sendcount, (2*sendcount - 1)] are sent to processor 1
And so on...
Illustrated:
NOTE: the user must may sure that the sendbuf contains sendcount * numprocs items !!!

sendtype - type of data sent (valid for rootID only)
recvbuf - buffer for receiving data
recvcount - number of items to receive
recvtype - type of data received
rootID - id of root processor (who is doing the send operation)
comm - the communicator group

NOTES:

The types of data sent and data received should match: sendtype equal to recvtype
The amount of data received should be the amount sent: sendcount equal to recvcount
(Because in the worst case scenario, all data will end up in the buffer of 1 processor)

However: these rules are not strictly enforced.

(Don't blame MPI for causing "funny errors" if you decide to violate these rules :-))

Example 1: source is node 0

The effect of MPI_Scatter() where node 0 is the scatterer:
Note:
The effect of MPI_Scatter() where node 1 is the scatterer:

Programming Example:

Processor 0 distributes 2 integer to every processor
Each processor adds the numbers and returns the sum to proc 0

int main(int argc, char **argv) { int buff[100]; int recvbuff[2]; int numprocs; int myid; int i, k; int mysum; MPI_Status stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if ( myid == 0 ) { cout << "WE have " << numprocs << " processors" << endl; // ----------------------------------------------- // Node 0 prepare 2 number for each processor // [1][2] [3][4] [5][6] .... etc // ----------------------------------------------- k = 1; for ( i = 0; i < 2*numprocs; i += 2 ) { buff[i] = k++; buff[i+1] = k++; } } // ------------------------------------------ // Node 0 scatter the array to the processors: // ------------------------------------------ MPI_Scatter (buff, 2, MPI_INT, recvbuff, 2, MPI_INT, 0, MPI_COMM_WORLD); ^^^ !!! ^^^ !!! if ( myid == 0 ) { // Processor 0 mysum = recvbuff[0] + recvbuff[1]; cout << "Processor " << myid << ": sum = " << mysum << endl; for( i = 1; i < numprocs; i++) { MPI_Recv(&mysum, 1, MPI_INT, i, 0, MPI_COMM_WORLD, &stat); cout << "Processor " << i << ": sum = " << mysum << endl; } } else { // Other processors mysum = recvbuff[0] + recvbuff[1]; MPI_Send(&mysum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); }

Example Program: (Scatter)
- Prog file: click here
Demo instruction:
1. ssh2 puma
2. cd sunhome/teaching/web/355/Syllabus/92-MPI/Progs
3. mpiCC -o Scatter Scatter.C
4. mpirun -np 4 ./Scatter

The Gather Group Communication primitives

The MPI_Gather() is usually used in conjunction with the MPI_Gather() call.
It does the reverse of MPI_Gather()....
Illustrated:

Syntax of the MPI_Gather() call:

MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int rootID, MPI_Comm comm)

sendbuf - the data (must be valid for ALL processors)
sendcount - number of items that will be sent to the rootID process
sendtype - type of data sent
recvbuf - buffer for receiving data
recvcount - number of items to receive (per processor)
recvtype - type of data received
rootID - id of root processor who is doing the RECEIVE operation
comm - the communicator group

NOTES:
Again: these rules are not strictly enforced.
(And again, don't blame MPI for causing "funny errors" if you decide to violate these rules....)

Example 1: the "getherer" is node 0
Example 2: the "getherer" is node 1

Example: Same example above, with MPI_Gather()

Processor 0 distributes 2 integer to every processor
Each processor adds the numbers and returns the sum to proc 0

int main(int argc, char **argv) { int buff[100]; int recvbuff[2]; int numprocs; int myid; int i, k; int mysum; MPI_Status stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if ( myid == 0 ) { cout << "WE have " << numprocs << " processors" << endl; // ----------------------------------------------- // Node 0 prepare 2 number for each processor // [1][2] [3][4] [5][6] .... etc // ----------------------------------------------- k = 1; for ( i = 0; i < 2*numprocs; i += 2 ) { buff[i] = k++; buff[i+1] = k++; } } // ------------------------------------------ // Node 0 scatter the array to the processors: // ------------------------------------------ MPI_Scatter (buff, 2, MPI_INT, recvbuff, 2, MPI_INT, 0, MPI_COMM_WORLD); mysum = recvbuff[0] + recvbuff[1]; // Everyone calculate sum // ------------------------------------------ // Node 0 collects the results in "buff": // ------------------------------------------ MPI_Gather (&mysum, 1, MPI_INT, &buff, 1, MPI_INT, 0, MPI_COMM_WORLD); // ------------------------------------------ // Node 0 prints result // ------------------------------------------ if ( myid == 0 ) { for( i = 0; i < numprocs; i++) { cout << "Processor " << i << ": sum = " << buff[i] << endl; } } MPI_Finalize(); }

Example Program: (Scatter)
- Prog file: click here
Demo instruction:
1. ssh2 puma
2. cd sunhome/teaching/web/355/Syllabus/92-MPI/Progs
3. mpiCC -o Gather Gather.C
4. mpirun -np 4 ./Gather

The Reduce Group Communication primitives

Often, the data gathered by the MPI_Gather() need to be operated on to produce the final result.
For example, results of the computations could be a part of a sum and the results gathered needs to be added up to produce the final sum.
MPI provides a convenient on the fly gather and compute function to make the programming easier: MPI_Reduce()....
Syntax of the MPI_Reduce() call:
- sendbuf - the data (must be valid for ALL processors)
- recvbuf - buffer used to receiving data and simultaneously perform the "reducing operation".
- recvcount - number of items to receive (per processor)
- recvtype - type of data received
- op - type of operation used as the "reducing operation"
- rootID - id of root processor who is doing the RECEIVE and REDUCE operation
- comm - the communicator group
NOTES:
Again: these rules are not strictly enforced.
(And again, don't blame MPI for causing "funny errors" if you decide to violate these rules....)
Example: node 0 performs the (reducing) MPI_SUM operation
The effect is the same as:

Available Reduction Operations: (directly from MPI)

MPI Reduction Operation	Effect of the Reduction Operation
`MPI_MAX`	Finds the maximum value
`MPI_MIN`	Finds the minimum value
`MPI_SUM`	Computes the sum of all value
`MPI_PROD`	Computes the product of all value
`MPI_LAND`	Computes the "logical AND" of all value (0 = false, non-zero = true)
`MPI_BAND`	Computes the "Bitwise AND" of all value
`MPI_LOR`	Computes the "logical OR" of all value (0 = false, non-zero = true)
`MPI_BOR`	Computes the "Bitwise OR" of all value
`MPI_LXOR`	Computes the "logical XOR" of all value (0 = false, non-zero = true)
`MPI_BXOR`	Computes the "Bitwise XOR" of all value
`MPI_MAXLOC`	Find the maximum value and the processor ID that has the value (you need to pass a structure with these 2 elements: (double value, int rank))
`MPI_MINLOC`	Find the minimum value and the processor ID that has the value (you need to pass a structure with these 2 elements: (double value, int rank))

Example: computing Pi using MPI_Reduce()

Each processor computes a part of the integral
Processor 0 adds the returned sums together

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } /* ======================= MAIN ======================= */ int main(int argc, char *argv[]) { int N; double w, x; int i, myid; double mypi, final_pi; MPI_Init(&argc,&argv); // Initialize MPI_Comm_size(MPI_COMM_WORLD, &num_procs); // Get # processors MPI_Comm_rank(MPI_COMM_WORLD, &myid); // Get my rank (id) if ( myid == 0 ) N = atoi(argv[1]); MPI_Bcast (&N, 1, MPI_INT, 0, MPI_COMM_WORLD); w = 1.0/(double) N; /* ******************************************************************* */ mypi = 0.0; for (i = myid; i < N; i = i + num_procs) { x = w*(i + 0.5); mypi = mypi + w*f(x); } /* ******************************************************************* */ MPI_Reduce ( &mypi, &final_pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if ( myid == 0 ) { cout << "Pi = " << final_pi << endl << endl; } MPI_Finalize(); }

Example Program: (Scatter)
- Prog file: click here
Demo instruction:
1. ssh2 puma
2. cd sunhome/teaching/web/355/Syllabus/92-MPI/Progs
3. mpiCC -o MPI-Pi2 MPI-Pi2.C
4. mpirun -np 4 ./MPI-Pi2

Advanced material below: skipped !!! (Read on by yourself if you're interested)

Using your own reduction function

MPI allows you to define your own reduction operation to be used in the MPI_Reduce() function.
You define your own reduction operation as follows:
- function - the function that you want to use to perform the reduction operation
- commute - set this parameter to 0 (zero) if the operation does NOT commute, otherwise (if "a op b" = "b op a"), set it to 1 (ONE)
- op - return value: the operation handle that you need to pass to MPI_Reduce().
The function that you use to perform the reduction operation MUST have the following form :
- in - pointer to the first operand
- inout - pointer to the second operand
  The second operand is also the output variable
- len - number of elements in in and inout
- datatype - the type of data stored in in and inout

Example: use our own SUM function in MPI_Reduce()

void myAdd( void *a, void *b, int *len, MPI_Datatype *datatype) { int i; if ( *datatype == MPI_INT ) { int *x = (int *)a; // Turn the (void *) into an (int *) int *y = (int *)b; // Turn the (void *) into an (int *) for (i = 0; i < *len; i++) { *y = *x + *y; x++; y++; } } else if ( *datatype == MPI_DOUBLE ) { double *x = (double *)a; // Turn the (void *) into an (double *) double *y = (double *)b; // Turn the (void *) into an (double *) for (i = 0; i < *len; i++) { *y = *x + *y; x++; y++; } } } double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } /* ======================= MAIN ======================= */ int main(int argc, char *argv[]) { int N; double w, x; int i, myid; double mypi, final_pi; MPI_Op myOp; MPI_Init(&argc,&argv); // Initialize MPI_Comm_size(MPI_COMM_WORLD, &num_procs); // Get # processors MPI_Comm_rank(MPI_COMM_WORLD, &myid); // Get my rank (id) if ( myid == 0 ) N = atoi(argv[1]); MPI_Bcast (&N, 1, MPI_INT, 0, MPI_COMM_WORLD); w = 1.0/(double) N; /* ******************************************************************* */ mypi = 0.0; for (i = myid; i < N; i = i + num_procs) { x = w*(i + 0.5); mypi = mypi + w*f(x); } /* ******************************************************************* */ MPI_Op_create( myAdd, 1, &myOp); MPI_Reduce ( &mypi, &final_pi, 1, MPI_DOUBLE, myOp, 0, MPI_COMM_WORLD); if ( myid == 0 ) { cout << "Pi = " << final_pi << endl << endl; } MPI_Finalize(); }

Example Program: (Scatter)
- Prog file: click here
Demo instruction:
1. ssh2 puma
2. cd sunhome/teaching/web/355/Syllabus/92-MPI/Progs
3. mpiCC -o MPI-Pi3 MPI-Pi3.C
4. mpirun -np 4 ./MPI-Pi3

Barrier Synchronization
- A barrier is a location in the program where all processes will synchronize
- You set a barrier in a program to make sure that all processes has reached the point before proceeding
- Syntax of the MPI_Barrier() call:
  Effect: