Threads, Blocks and Grids

Thread

Thread:

Thread = an execution unit that is used to execute instructions in a kernel function
Multiple threads will be executing inside a GPU !!!

Block

Block:

Block = a group of threads that are executed concurrently
Each thread in a block will execute the same series of statements (= same program !!) (but using different variables !!!)

Threads in the same block can be synchronized

(But we will not learn about synchronization in this short course on CUDA)

Grid

Grid:

Grid = a group of blocks
Threads in the grid also execute the same code
But: threads in different blocks can be run concurrently or sequentially

Threads in different thread blocks in a grid can not be synchronized

Execution configuration

Execution configuration:

The expression <<< 1, 4 >>> in:

hello<<< 1, 4 >>>( ); ^^^^^^^^^^^^ grid <<<1, 4>>> means: 1 thread block, each thread block has 4 threads

is called a execution configuration in CUDA

The execution configuration tells the CUDA runtime system:
to run/execute the given kernel function.

Meaning of the execution configuration:

<<< NBlocks, NThreads >>> NBlocks = # blocks used NThreads = # threads in each block

Example:

<<< 1, 4 >>>: use 1 block, with 4 (parallel) threads in each block <<< 3, 4 >>>: use 3 blocks, with 4 (parallel) threads in each block

Note: the execution configuration expression is highy simplified

I used only integers to specify the grid size and (thread) block size
You can use a dim3 (a 3 dimensional quantity) to specify grid size and (thread) block size !!
You can have multi-dimensional grids and thread blocks....

Running a kernel function (launching a kernel)

(Simplified) syntax (using 1 grid):

kernelFuncName <<< NBlocks, NThreads >>> ( params ) ; Effect: Launch the kernel function kernelFuncName using NBlocks blocks of threads NThreads threads in each block

Example:

hello <<< 2, 4 >>> ( ) ; Launch (run) "hello( )" kernel using a <<< 2, 4 >>> CUDA grid of threads

How do different threads dintinguish themselves from one another:

Each thread is assigned a unique identifier:

Kernel thread program variables: blockIdx[.x, .y, .z] and threadIdx[.x, .y, .z]

Built-in program variables in each thread:

blockIdx.x = block index of the (current) thread
threadIdx.x = thread index of the (current) thread

Example program that show that different threads as assigned different blockIdx.x and threadIdx.x values:

#include <stdio.h> #include <unistd.h> __global__ void hello( ) { printf("gridDim.x=%d, blockIdx.x=#%d, blockDim.x=%d, threadIdx.x=#%d\n", gridDim.x, blockIdx.x, blockDim.x, threadIdx.x); } int main() { hello<<< 2, 4 >>>( ); printf("I am the CPU: Hello World ! \n"); cudaDeviceSynchronize(); }

Example Program: (Demo above code)

/home/cs355001/demo/CUDA/1-intro/hello-thrIndex Output: I am the CPU: Hello World ! I am in block #0 and thread #0: Hello World ! I am in block #0 and thread #1: Hello World ! I am in block #0 and thread #2: Hello World ! I am in block #0 and thread #3: Hello World ! I am in block #1 and thread #0: Hello World ! I am in block #1 and thread #1: Hello World ! I am in block #1 and thread #2: Hello World ! I am in block #1 and thread #3: Hello World !

Block Dimension (= # threads in a block)

Important dimension variables defined in each thread:

blockDim.x = # threads in each block
gridDim.x = # blocks in the grid

Example program

__global__ void hello( ) { printf("blockIdx.x=%d/%d block, threadIdx.x=%d/%d threads\n", blockIdx.x, gridDim.x, threadIdx.x, blockDim.x); } int main() { hello<<< 2, 4 >>>( ); printf("I am the CPU: Hello World ! \n"); cudaDeviceSynchronize(); return 0; }

Example Program: (Demo above code)

/home/cs355001/demo/CUDA/1-intro/hello-dim Output: blockIdx.x=0/2 blocks, threadIdx.x=0/4 threads -+ blockIdx.x=0/2 blocks, threadIdx.x=1/4 threads | block #0 blockIdx.x=0/2 blocks, threadIdx.x=2/4 threads | blockIdx.x=0/2 blocks, threadIdx.x=3/4 threads -+ blockIdx.x=1/2 blocks, threadIdx.x=0/4 threads -+ blockIdx.x=1/2 blocks, threadIdx.x=1/4 threads | block #1 blockIdx.x=1/2 blocks, threadIdx.x=2/4 threads | blockIdx.x=1/2 blocks, threadIdx.x=3/4 threads -+

Assigning a unique (thread) ID to each thread in a 1-dimensional grid

Assigning a unique ID to a thread in a 1-dimensional grid

Suppose we want to assign the "thread ID as follows:

blockIdx.x=0/2 blocks, threadIdx.x=0/4 threads --> thread #0 blockIdx.x=0/2 blocks, threadIdx.x=1/4 threads --> thread #1 blockIdx.x=0/2 blocks, threadIdx.x=2/4 threads --> thread #2 blockIdx.x=0/2 blocks, threadIdx.x=3/4 threads --> thread #3 blockIdx.x=1/2 blocks, threadIdx.x=0/4 threads --> thread #4 blockIdx.x=1/2 blocks, threadIdx.x=1/4 threads --> thread #5 blockIdx.x=1/2 blocks, threadIdx.x=2/4 threads --> thread #6 blockIdx.x=1/2 blocks, threadIdx.x=3/4 threads --> thread #7

Then:

threadID = blockDim.x*blockIdx.x + threadIdx.x

Example Program: (Demo above code)

/home/cs355001/demo/CUDA/1-intro/hello-thrIndex2 Output: gridDim.x=2, blockIdx.x=#1, blockDim.x=4, threadIdx.x=#0 -> ID=4 gridDim.x=2, blockIdx.x=#1, blockDim.x=4, threadIdx.x=#1 -> ID=5 gridDim.x=2, blockIdx.x=#1, blockDim.x=4, threadIdx.x=#2 -> ID=6 gridDim.x=2, blockIdx.x=#1, blockDim.x=4, threadIdx.x=#3 -> ID=7 gridDim.x=2, blockIdx.x=#0, blockDim.x=4, threadIdx.x=#0 -> ID=0 gridDim.x=2, blockIdx.x=#0, blockDim.x=4, threadIdx.x=#1 -> ID=1 gridDim.x=2, blockIdx.x=#0, blockDim.x=4, threadIdx.x=#2 -> ID=2 gridDim.x=2, blockIdx.x=#0, blockDim.x=4, threadIdx.x=#3 -> ID=3 I am the CPU: Hello World !

The unique threadID will be very important when we write CUDA programs where different threads perform different computations !!!