From:

  https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Summary:

Asynchronous commands return control to the calling host thread before 
the device has finished the requested task (they are non-blocking). 
The asynchronous commands are:

    1. Kernel launches;
    2. Memory copies between two addresses to the same device memory;
    3. Memory copies from host to device of a memory block of 64 KB or less;
    4. Memory copies performed by functions with the Async suffix;
    5. Memory set function calls.


How to launch a kernel on the default stream:

  kernel<<< blocks, threads, bytes >>>();    // default stream
  kernel<<< blocks, threads, bytes, 0 >>>(); // = stream 0

CUDA 7 introduces a new option, the per-thread default stream, 
that has two effects. 
  First, it gives each host thread its own default stream. 
  This means that commands issued to the default stream by 
  different host threads can run concurrently. 

  Second, these default streams are regular streams. 
  This means that commands in the default stream may run concurrently 
  with commands in non-default streams.

To enable per-thread default streams in CUDA 7 and later, you can either 

    1. compile with the nvcc command-line option --default-stream per-thread, 

 or 2. #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro 
       before including CUDA headers (cuda.h or cuda_runtime.h).

DO NOT USE 2 in a .cu file, because:

   in a .cu file when the code is compiled by nvcc because nvcc implicitly 
   includes cuda_runtime.h at the top of the translation unit.

=============================================================================

Visual:

	nvvp  a.out