Matrices are 2 dimensional structures:
+- -+ +- -+
|A11 A12 A13| |B11 B12 B13|
A = |A21 A22 A23| B = |B21 B22 B23|
|A31 A32 A33| |B31 B32 B33|
+- -+ +- -+
Then:
+- -+
|C11 C12 C13|
C = A*B = |C21 C22 C23|
|C31 C32 C33|
+- -+
where:
Cij = ∑k=0..N-1 AikBkj
(for i = 0, 1, .., N-1 and j = 0, 1, ..., N-1)
|
Matrix elements are organized using a 2 dimensional index:
int N = ...; // Some pre-defined value
float A[N][N]; // Matrix 1
float B[N][N]; // Matrix 2
float C[N][N]; // Output matrix
//CPU matrix mult alg for static 2-dimensional arrays
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
{
/* ---------------------------------
Compute the matrix element
Cij = ∑k=0..N-1 AikBkj
--------------------------------- */
C[i][j] = 0.0;
for (int k = 0; k < N; k++)
C[i][j] = C[i][j] + A[i][k]*B[k][j]; // 1 vector product
}
|
The previous CUDA program launches a 1-dimensional grid (as <<< B,T >>>) to perform the matrix multiplication:
int main(int argc, char *argv[])
{
int N = Matrix size
int T = # threads per thread block (user choice)
int B = ceil((float) N*N / T ); // # thread blocks needed
float *a, *b, *c;
/* =======================================
Allocate 3 shared matrices (as arrays)
======================================= */
cudaMallocManaged(&a, N*N*sizeof(float));
cudaMallocManaged(&b, N*N*sizeof(float));
cudaMallocManaged(&c, *sizeof(float));
// initialize x and y arrays (code omitted)
matrixMult<<< B, T >>>(a, b, c, N); // Launch kernel
}
|
It's more natural to use a 2 dimensional grid to execute matrix operations !!!
The number of matrix elements in a N × N matrix is N2:
We will divide the matrix up into squares...
We will create thread block that is a square:
Each
thread block will
contain T2 threads
Note:
T2 ≤ 1024
or
T ≤ 32
The number of rows and columns of the grid is B:
Question: what is the value of B to accommodate a N × N matrix ?
The number of blocks that you need to use to span an N × N matrix is:
Answer: we need B × B blocks where B = ceil(N/T)
The CPU code: (1) allocate arrays and (2) then launches the threads:
int main(int argc, char *argv[])
{
int N = ..; // N × N matrix
}
|
N will be specified by the user input
T = the number of threads in a block in one dimension:
int main(int argc, char *argv[])
{
int N = ..; // N × N matrix
int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block
}
|
We will create B2 thread blocks to perform the matrix multiplication:
int main(int argc, char *argv[])
{
int N = ..; // N × N matrix
int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block
int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B);
}
|
Each dimension will have ≥ N threads !!!
We define the reference variables to help us allocate the shared arrays:
int main(int argc, char *argv[])
{
int N = ..; // N × N matrix
int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block
int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B);
float *a, *b, *c;
}
|
(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)
Allocate the 3 shared matrices (as 1-dim arrays):
int main(int argc, char *argv[])
{
int N = ..; // N × N matrix
int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block
int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B);
float *a, *b, *c;
/* =======================================
Allocate 3 shared matrices (as arrays)
======================================= */
cudaMallocManaged(&a, N*N*sizeof(float));
cudaMallocManaged(&b, N*N*sizeof(float));
cudaMallocManaged(&c, N*N*sizeof(float));
// initialize x and y arrays (code omitted)
}
|
(This is CUDA syntax... i.e, how CUDA provide the dynamic shared array to us)
Launch (at least) N2 threads as <<< gridShape, blockShape >>> to perform the matrix multiplication:
int main(int argc, char *argv[])
{
int N = ..; // N × N matrix
int T = ..; dim3 blockShape = dim3(T, T); // TxT threads/block
int B = ceil( (double)N/T ); dim3 gridShape = dim3(B, B);
float *a, *b, *c;
/* =======================================
Allocate 3 shared matrices (as arrays)
======================================= */
cudaMallocManaged(&a, N*N*sizeof(float));
cudaMallocManaged(&b, N*N*sizeof(float));
cudaMallocManaged(&c, N*N*sizeof(float));
// initialize x and y arrays (code omitted)
matrixMult<<< gridShape, blockShape >>>(a, b, c, N); // 2 dim grid
}
|
We will write the kernel code for matrixMult( ) next...
First, compute a unique ID (i,j) for each thread within the <<< B, T >>> configuration:
__global__ void matrixMult(float *A, float *B, float *C, int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
}
|
The thread (i,j) on the GPU must compute the element C[i][j] in the output matrix:
__global__ void matrixMult(float *A, float *B, float *C, int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
C[i*n+j] = 0.0; // C[i*n+j] = C[i][j]
/* ---------------------------------
Compute the matrix element
Cij = ∑k=0..N-1 AikBkj
--------------------------------- */
}
|
The algorithm to compute C[i][j] is:
__global__ void matrixMult(float *A, float *B, float *C, int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
C[i*n+j] = 0.0; // C[i*n+j] = C[i][j]
/* ---------------------------------
Compute the matrix element
Cij = ∑k=0..N-1 AikBkj
--------------------------------- */
for ( int k = 0; k < n; k++ )
C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[i][j]
}
|
Notice: we created more than N2 threads with <<< B, T >>> !!!
To prevent the computation of non-existing matrix elements by the extra threads, we use:
__global__ void matrixMult(float *A, float *B, float *C, int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
if ( i < n && j < n )
{
C[i*n+j] = 0.0; // C[i*n+j] = C[i][j]
/* ---------------------------------
Compute the matrix element
Cij = ∑k=0..N-1 AikBkj
--------------------------------- */
for ( int k = 0; k < n; k++ )
C[i*n+j] += A[i*n+k] * B[k*n+j]; // C[i*n+j] = C[i][j]
}
}
|
DEMO: /home/cs355001/demo/CUDA/4-mult-matrix/2d-mult-matrix.cu
|