Achieving highest performance in parallel processing

Factors that affect program performance in parallel processing:

  • More simulatneous executions will generally improve performance in parallel executions

    Therefore:

      • Parallel programs should use as many threads as possible

  • However:

      • More simultaneous threads will require a more (thread) context switching operations

        The context switching can become a bottle neck that can reduce the program performance !!!

  • Fortunately:

    • Thread state (context) are stored in registers and thread context switching is fast

Performance experiment with the matrix multiplication CUDA program

We run the CUDA matrix multiplication program using different #threads/thread block:

mult-matrix2 1000 1
Elasped time = 1896357 micro secs

mult-matrix2 1000 2
Elasped time =  984452 micro secs

T = 4 threads/thread block
Elasped time =  521900 micro secs

T = 8 threads/thread block
Elasped time =  303541 micro secs

T = 16 threads/thread block
Elasped time =  140619 micro secs

T = 32 threads/thread block
Elasped time =   74004 micro secs

T = 64 threads/thread block
Elasped time =   41799 micro secs

T = 128 threads/thread block
Elasped time =   42553 micro secs // increases !
  

More threads does not mean better performance !!!

How to find the best # threads/thread block for your CUDA program ??
 

Finding the "suitable" (= good performance) # threads/thread block for a CUDA application

  • It used to be a trial-and-error exercise to find the good setting for # threads/thread block for a CUDA application

  • Nowadays, CUDA program can use the following CUDA library function:

        cudaOccupancyMaxPotentialBlockSize( )    
      

    to compute a reasonably efficient execution configuration for a kernel

How to use cudaOccupancyMaxPotentialBlockSize( )

Documentation on cudaOccupancyMaxPotentialBlockSize( ) :

cudaOccupancyMaxPotentialBlockSize ( int* minGridSize, int* blockSize, 
          T func, size_t dynamicSMemSize = 0, int  blockSizeLimit = 0 ) 

    Returns grid and block size that achieves 
    maximum potential occupancy for a device function func.

Parameters

    minGridSize
        - Returned minimum grid size needed to achieve 
          the best potential occupancy 
    blockSize
        - Returned block size 
    func
        - Device function symbol 
 
Function description:

    Returns in *minGridSize and *blocksize a suggested grid size and
    block size that achieves the best potential occupancy  

How to use cudaOccupancyMaxPotentialBlockSize( )

How to use cudaOccupancyMaxPotentialBlockSize( ):

  int minB;  // Min grid size
           
  int T;     // block size = # threads in a thread block

  cudaOccupancyMaxPotentialBlockSize( &minB, &T, matrixMult, 0, 0);

  printf("Computed: minB = %d, T = %d\n", minB, T);

  int B = ceil((float) N*N / T ); // Round up to integral grid size

  // Run kernel on the GPU using NBlks block, K thread/per block
  matrixMult<<<B, T>>>(a, b, c, N);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
 

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix-auto.cu