Factors that affect program performance in parallel processing:
|
We run the CUDA matrix multiplication program using different #threads/thread block:
mult-matrix2 1000 1 Elasped time = 1896357 micro secs mult-matrix2 1000 2 Elasped time = 984452 micro secs T = 4 threads/thread block Elasped time = 521900 micro secs T = 8 threads/thread block Elasped time = 303541 micro secs T = 16 threads/thread block Elasped time = 140619 micro secs T = 32 threads/thread block Elasped time = 74004 micro secs T = 64 threads/thread block Elasped time = 41799 micro secs T = 128 threads/thread block Elasped time = 42553 micro secs // increases ! |
More threads does not mean better performance !!!
Finding the "suitable" (= good performance) # threads/thread block for a CUDA application
|
Documentation on cudaOccupancyMaxPotentialBlockSize( ) :
cudaOccupancyMaxPotentialBlockSize ( int* minGridSize, int* blockSize, T func, size_t dynamicSMemSize = 0, int blockSizeLimit = 0 ) Returns grid and block size that achieves maximum potential occupancy for a device function func. Parameters minGridSize - Returned minimum grid size needed to achieve the best potential occupancy blockSize - Returned block size func - Device function symbol Function description: Returns in *minGridSize and *blocksize a suggested grid size and block size that achieves the best potential occupancy |
How to use cudaOccupancyMaxPotentialBlockSize( ):
int minB; // Min grid size
int T; // block size = # threads in a thread block
cudaOccupancyMaxPotentialBlockSize( &minB, &T, matrixMult, 0, 0);
printf("Computed: minB = %d, T = %d\n", minB, T);
int B = ceil((float) N*N / T ); // Round up to integral grid size
// Run kernel on the GPU using NBlks block, K thread/per block
matrixMult<<<B, T>>>(a, b, c, N);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
|
DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/mult-matrix-auto.cu