Using a 2 dimensional grid in GPU matrix multiplication

Matrices are 2 dimensional structures:

+- -+ +- -+ |A₁₁ A₁₂ A₁₃| |B₁₁ B₁₂ B₁₃| A = |A₂₁ A₂₂ A₂₃| B = |B₂₁ B₂₂ B₂₃| |A₃₁ A₃₂ A₃₃| |B₃₁ B₃₂ B₃₃| +- -+ +- -+ Then: +- -+ |C₁₁ C₁₂ C₁₃| C = A*B = |C₂₁ C₂₂ C₂₃| |C₃₁ C₃₂ C₃₃| +- -+ where: C_ij = ∑_{_k=0..N-1} A_ikB_kj (for i = 0, 1, .., N-1 and j = 0, 1, ..., N-1)

Using a 2 dimensional grid in GPU matrix multiplication

Matrix elements are organized using a 2 dimensional index:

int N = ...; // Some pre-defined value float A[N][N]; // Matrix 1 float B[N][N]; // Matrix 2 float C[N][N]; // Output matrix //CPU matrix mult alg for static 2-dimensional arrays for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; // 1 vector product }

Previously, the CUDA program launches a 1 dimensional grid of threads

The previous CUDA program launches a 1-dimensional grid (as <<< B,T >>>) to perform the matrix multiplication:

int main(int argc, char *argv[]) { int N = Matrix size int T = # threads per thread block (user choice) int B = ceil((float) N*N / T ); // # thread blocks needed float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, *sizeof(float)); // initialize x and y arrays (code omitted) matrixMult<<< B, T >>>(a, b, c, N); // Launch kernel }

It's more natural to use a 2 dimensional grid to execute matrix operations !!!

Designing a 2-dimensional grid for matrix multiplicaton

The number of matrix elements in a N × N matrix is N²:

We will divide the matrix up into squares...

Designing a 2-dimensional grid for matrix multiplicaton

We will create thread block that is a square:

Each thread block will contain T² threads
Note: T² ≤ 1024 or T ≤ 32

Designing a 2-dimensional grid for matrix multiplicaton

The number of rows and columns of the grid is B:

Question: what is the value of B to accommodate a N × N matrix ?

Designing a 2-dimensional grid for matrix multiplicaton

The number of blocks that you need to use to span an N × N matrix is:

Answer: we need B × B blocks where B = ceil(N/T)

The CUDA matrix multiplication algorithm - CPU part

The CPU code: (1) allocate arrays and (2) then launches the threads:

int main(int argc, char *argv[]) { int N = ..; // N × N matrix }

N will be specified by the user input

The CUDA matrix multiplication algorithm - CPU part

T = the number of threads in a block in one dimension:

int main(int argc, char *argv[]) { int N = ..; // N × N matrix int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block }

The CUDA matrix multiplication algorithm - CPU part

We will create B² thread blocks to perform the matrix multiplication:

int main(int argc, char *argv[]) { int N = ..; // N × N matrix int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B); }

Each dimension will have ≥ N threads !!!

The CUDA matrix multiplication algorithm - CPU part

We define the reference variables to help us allocate the shared arrays:

int main(int argc, char *argv[]) { int N = ..; // N × N matrix int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B); float *a, *b, *c; }

(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)

The CUDA matrix multiplication algorithm - CPU part

Allocate the 3 shared matrices (as 1-dim arrays):

int main(int argc, char *argv[]) { int N = ..; // N × N matrix int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B); float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, N*N*sizeof(float)); // initialize x and y arrays (code omitted) }

(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)

The CUDA matrix multiplication algorithm - CPU part

Launch (at least) N² threads as <<< gridShape, blockShape >>> to perform the matrix multiplication:

int main(int argc, char *argv[]) { int N = ..; // N × N matrix int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B); float *a, *b, *c; /* ======================================= Allocate 3 shared matrices (as arrays) ======================================= */ cudaMallocManaged(&a, N*N*sizeof(float)); cudaMallocManaged(&b, N*N*sizeof(float)); cudaMallocManaged(&c, N*N*sizeof(float)); // initialize x and y arrays (code omitted) matrixMult<<< gridShape, blockShape >>>(a, b, c, N); // 2 dim grid }

We will write the kernel code for matrixMult( ) next...

The CUDA matrix multiplication algorithm - GPU part

First, compute a unique ID (i,j) for each thread within the <<< B, T >>> configuration:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; }

The CUDA matrix multiplication algorithm - GPU part

The thread (i,j) on the GPU must compute the element C[i][j] in the output matrix:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; C[i*n+j] = 0.0; // C[i*n+j] = C[i][j] /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ }

The CUDA matrix multiplication algorithm - GPU part

The algorithm to compute C[i][j] is:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; C[i*n+j] = 0.0; // C[i*n+j] = C[i][j] /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ for ( int k = 0; k < n; k++ ) C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[i][j] }

Notice: we created more than N² threads with <<< B, T >>> !!!

The CUDA matrix multiplication algorithm - GPU part

To prevent the computation of non-existing matrix elements by the extra threads, we use:

__global__ void matrixMult(float *A, float *B, float *C, int n) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; if ( i < n && j < n ) { C[i*n+j] = 0.0; // C[i*n+j] = C[i][j] /* --------------------------------- Compute the matrix element C_ij = ∑_{_k=0..N-1} A_ikB_kj --------------------------------- */ for ( int k = 0; k < n; k++ ) C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[i][j] } }

DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/2d-mult-matrix.cu

Postscript

Compare the execution time of the 2 CUDA programs that perform matrix multiplication:

mult-matrix2 1000 1024 T = 1024 threads/thread block N*N = 1000000/T = 1024 = 976.562500 ---> using 977 blocks Elasped time = 113,536 micro secs #errors = 0 2d-mult-matrix 1000 32 gridShape = 32 x 32, # blocks = 1024 blockShape = 32 x 32, # threads = 1024 Elasped time = 1,223,438 micro secs #errors = 0

Different grid shapes can result in:
Different access patterns of the matrix elements
This will result in:
Different number of cache hits and different performance