How do I choose grid and block dim in CUDA kernels

=============================================================

Trick: add a limit check

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

I have added a limit check in the code so that in cases 
where more threads are launched than required operations, 
no buffer overflow can occur. 

=============================================================

If the kernel is then launched like this:

   const int n = 128 * 1024;
   int blocksize = 512; 
                     // value usually chosen by tuning and hardware constraints
   int nblocks = ceil(n / nthreads);  // = 256
                     // value determine by block size and total work

   mAdd<<<nblocks,blocksize>>>(A,B,C,n);

Then 256 blocks, each containing 512 threads, will be launched onto 
the GPU hardware to perform the array addition operation in parallel. 

Note that if the input data size was not expressible as a nice round 
multiple of the block size, the number of blocks would need to be rounded 
up to cover the full input data set.

=====================================================================

Hardware Constraints:

This is the easy to quantify part. 
Appendix F of the current CUDA programming guide lists a number of 
hard limits which limit how many threads per block a kernel launch can have. 
If you exceed any of these, your kernel will never run. They can be 
roughly summarized as:

  1. Each block cannot have more than 512/1024 threads in total 
     (Compute Capability 1.x or 2.x and later respectively)

  2. The maximum dimensions of each block are limited to 
     [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)

  3. Each block cannot consume more than 
     8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total 
     (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)

  4. Each block cannot consume more than 16kb/48kb/96kb of shared memory 
     (Compute 1.x/2.x-6.2/7.0)

If you stay within those limits, any kernel you can successfully compile 
will launch without error.


Performance Tuning:

This is the empirical part. 
The number of threads per block you choose within the hardware constraints 
outlined above can and does effect the performance of code running 
on the hardware. 
How each code behaves will be different and the only real way to quantify 
it is by careful benchmarking and profiling. 
But again, very roughly summarized:

  1. The number of threads per block should be a round multiple of 
     the warp size, which is 32 on all current hardware.

  2. Each streaming multiprocessor unit on the GPU must have enough active 
     warps to sufficiently hide all of the different memory and instruction 
     pipeline latency of the architecture and achieve maximum throughput. 
     The orthodox approach here is to try achieving optimal hardware occupancy 
     (what Roger Dahl's answer is referring to).

The second point is a huge topic which I doubt anyone is going to try and 
cover it in a single StackOverflow answer. There are people writing PhD theses 
around the quantitative analysis of aspects of the problem 
(see this presentation by Vasily Volkov from UC Berkley and this paper 
by Henry Wong from the University of Toronto for examples of how complex 
the question really is).

At the entry level, you should mostly be aware that the block size you 
choose (within the range of legal block sizes defined by the constraints 
above) can and does have a impact on how fast your code will run, but 
it depends on the hardware you have and the code you are running. 
By benchmarking, you will probably find that most non-trivial code has a 
"sweet spot" in the 128-512 threads per block range, but it will require 
some analysis on your part to find where that is. The good news is that 
because you are working in multiples of the warp size, the search space 
is very finite and the best configuration for a given piece of code 
relatively easy to find.



=====

"The number of threads per block must be a round multiple of the warp size" 
this is not a must but you waste resources if it is not. 


=====
https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf

https://bitbucket.org/rvuduc/volkov-gtc10
=====

CUDA tip:

https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/


One of the useful functions is cudaOccupancyMaxPotentialBlockSize which 
heuristically calculates a block size that achieves the maximum occupancy. 
The values provided by that function could be then used as the starting point 
of a manual optimization of the launch parameters. Below is a little example.

	tuning.cu

