CUDA program: kernel

Kernel = the program (function) that is executed by the GPU

Example:

__global__ void hello( ) { printf("Hello World\n"); // CUDA C code // uses printf( ) in CUDA C library }

A kernel (= a GPU function/program) is executed by a grid (of threads)

Note:

Different threads will use different operands

Threads (terminology)

Thread = single execution unit that run CUDA code ("kernel") on the GPU

Each thread is executed by 1 CUDA core (= processor)
Multiple threads can be assigned to the same CUDA core
(A CUDA core will switch execution between different threads !)

Thread organization: thread block

Multiple threads are organized (= grouped) into a "thread block"
Organization:

A (thread) block has 3 dimensions:

x ≤ 1024 y ≤ 1024 and x * y * z ≤ 1024 z ≤ 64

Thread organization: grid

Multiple thread blocks are organized (= grouped) into a "grid"
Organization:

A grid also has 3 dimensions:

x ≤ 2³¹-1 y ≤ 65535 z ≤ 65535

CUDA program execution: "launching" the kernel on a grid

Grid = all the threads that execute the same CUDA kernel function

A grid can have any number of threads
A grid will (therefore) consists of 1 or more thread blocks
(A thread block contain upto 1024 threads)

A grid is create by the host program when it "launches" (= calls) a kernel function

Kernel launching syntax:

KernelFunction <<<NBlocks, NThreads>>> (params); Run KernelFunction on GPU using a grid that consists of: NBlocks thread blocks with NThreads in each thread block

Mapping between grid and (thread) blocks and threads on a GPU computer

Recall: a GPU computer consists of a number (N) of Multiprocessors:

Mapping between grid and (thread) blocks and threads on a GPU computer

A thread is executed on a core:

Mapping between grid and (thread) blocks and threads on a GPU computer

A thread block is executed on one "stream" multiprocessor:

Threads can be switched (context switching) during the execution !!

Mapping between grid and (thread) blocks and threads on a GPU computer

A grid (same kernel) is executed on multiple "stream" multiprocessor:

Thread organization: communication between threads

Threads within the same thread block can communicate with each other using the "shared" memory:

Thread organization: communication between threads

For this reason: only threads in the same thread block can synchronize (= "wait on") with each other

Note: In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads in a grid. click here