Achieving highest performance in parallel processing

Factors that affect program performance in parallel processing:

More simulatneous executions will generally improve performance in parallel executions
Therefore:

However:

More simultaneous threads will require a more (thread) context switching operations
The context switching can become a bottle neck that can reduce the program performance !!!

Fortunately:
Thread state (context) are stored in registers and thread context switching is fast

Performance experiment with the matrix multiplication CUDA program

We run the CUDA matrix multiplication program using different #threads/thread block:

mult-matrix2 1000 1 Elasped time = 1896357 micro secs mult-matrix2 1000 2 Elasped time = 984452 micro secs T = 4 threads/thread block Elasped time = 521900 micro secs T = 8 threads/thread block Elasped time = 303541 micro secs T = 16 threads/thread block Elasped time = 140619 micro secs T = 32 threads/thread block Elasped time = 74004 micro secs T = 64 threads/thread block Elasped time = 41799 micro secs T = 128 threads/thread block Elasped time = 42553 micro secs // increases !

More threads does not mean better performance !!!

How to find the best # threads/thread block for your CUDA program ??

Finding the "suitable" (= good performance) # threads/thread block for a CUDA application

It used to be a trial-and-error exercise to find the good setting for # threads/thread block for a CUDA application

Nowadays, CUDA program can use the following CUDA library function:

cudaOccupancyMaxPotentialBlockSize( )

to compute a reasonably efficient execution configuration for a kernel

How to use cudaOccupancyMaxPotentialBlockSize( )

Documentation on cudaOccupancyMaxPotentialBlockSize( ):

cudaOccupancyMaxPotentialBlockSize ( int* minGridSize, int* blockSize, T func, size_t dynamicSMemSize = 0, int blockSizeLimit = 0 ) Returns grid and block size that achieves maximum potential occupancy for a device function func. Parameters minGridSize - Returned minimum grid size needed to achieve the best potential occupancy blockSize - Returned block size func - Device function symbol Function description: Returns in *minGridSize and *blocksize a suggested grid size and block size that achieves the best potential occupancy

How to use cudaOccupancyMaxPotentialBlockSize( )

How to use cudaOccupancyMaxPotentialBlockSize( ):

int minB; // Min grid size int T; // block size = # threads in a thread block cudaOccupancyMaxPotentialBlockSize( &minB, &T, matrixMult, 0, 0); printf("Computed: minB = %d, T = %d\n", minB, T); int B = ceil((float) N*N / T ); // Round up to integral grid size // Run kernel on the GPU using NBlks block, K thread/per block matrixMult<<<B, T>>>(a, b, c, N); // Wait for GPU to finish before accessing on host cudaDeviceSynchronize();

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix-auto.cu

❮ ❯