CUDA program:
kernel
Kernel =
the program (function)
that is executed by
the GPU
Example:
__global__ void hello( )
{
printf("Hello World\n");
// CUDA C code
// uses printf( ) in CUDA C library
}
|
A
kernel
(= a GPU function/program) is
executed by a
grid
(of
threads)
Note:
- Different threads
will use
different operands
|
Threads
(terminology)
Thread =
single execution unit that
run
CUDA code ("kernel")
on the GPU
Each thread is
executed by
1 CUDA core (= processor)
Multiple threads can be
assigned to the
same
CUDA core
(A CUDA core will
switch execution
between different threads !)
Thread organization:
thread block
-
Multiple
threads are
organized (= grouped) into a
"thread block"
-
Threads in the
same
thread block
are run on
the
same
stream multiprocessor (SM)
|
- Organization:
- A (thread) block has
3 dimensions:
x ≤ 1024
y ≤ 1024 and x * y * z ≤ 1024
z ≤ 64
|
|
Thread organization:
grid
-
Multiple
thread
blocks are
organized (= grouped) into a
"grid"
-
Threads in the
same
grid
are run on
the
same
GPU
|
- Organization:
- A grid also has
3 dimensions:
x ≤ 231-1
y ≤ 65535
z ≤ 65535
|
|
CUDA program execution:
"launching"
the kernel on a grid
- Grid =
all the threads that
execute the
same
CUDA
kernel function
- A
grid is
create by
the host program when it
"launches" (= calls)
a kernel function
-
Kernel launching
syntax:
KernelFunction <<<NBlocks, NThreads>>> (params);
Run KernelFunction on GPU using a grid that consists of:
NBlocks thread blocks with
NThreads in each thread block
|
|
Mapping between
grid and
(thread) blocks and
threads on a
GPU computer
Recall: a
GPU computer consists of a
number (N) of
Multiprocessors:
Mapping between
grid and
(thread) blocks and
threads on a
GPU computer
A
thread is
executed on a
core:
Mapping between
grid and
(thread) blocks and
threads on a
GPU computer
A
thread block
is
executed on one
"stream" multiprocessor:
Threads can be
switched
(context switching)
during the
execution !!
Mapping between
grid and
(thread) blocks and
threads on a
GPU computer
A
grid (same kernel)
is
executed on
multiple
"stream" multiprocessor:
Thread organization: communication
between threads
- Threads
within the same
thread block
can
communicate with each other
using the
"shared" memory:
|
Thread organization: communication
between threads
- For this reason:
only
threads in the same
thread block
can
synchronize (= "wait on")
with each other
|
Note:
In CUDA 9,
NVIDIA is introducing the concept of
cooperative groups,
allowing you to
synchronize all threads
in a grid.
click here
❮
❯