nvidia-cudaCUDA architecture has something which is very powerful to hide its worse bottleneck, the memory transfers. This capability give to the devices three different flows: host to device transfers, kernel computations and device to host transfers at the same time. In this article I show how we can increase up to 3,5x the kernel execution time (for this example) without modify any line of the kernel code.



We have two kernels with two inputs and one output each:

  •  operation1(d1_a, d1_b, d1_c)
  •  operation2(d2_a, d2_b, d2_c)

We also have six streams and four events:

  • stream1 ... stream6
  • e1 ... e4

Finally, we want a flow similar to this:

Host To Device transfer    d1_a, d1_b   
    d2_a, d2_b   
Kernel computation       operation1    operation2   
Device To Host transfer           d1_c     d2_c   



The code

Sometimes is not necessary to use events to synchronize the streams but in this case we need them to avoid some data race conditions. Let's see the computational loop of the code:

  for(i=0; i<5; i++)
   // MemCpy H2D of Kernel 1
   cudaMemcpyAsync(d1_a, h1_a, bytes, cudaMemcpyHostToDevice, stream1);
   cudaMemcpyAsync(d1_b, h1_b, bytes, cudaMemcpyHostToDevice, stream1);
   cudaEventRecord(e1, stream1);
   // MemCpy H2D of Kernel 2
   cudaMemcpyAsync(d2_a, h2_a, bytes, cudaMemcpyHostToDevice, stream4);
   cudaMemcpyAsync(d2_b, h2_b, bytes, cudaMemcpyHostToDevice, stream4);
   cudaEventRecord(e3, stream4);
   // Computation Kernel 1
   cudaStreamWaitEvent(stream2, e1, 0);
   operation1<<<gridSize, blockSize, 0, stream2>>>(d1_a, d1_b, d1_c, n);
   cudaEventRecord(e2, stream2);
   // Computation Kernel 2
   cudaStreamWaitEvent(stream5, e3, 0);
   operation2<<<gridSize, blockSize, 0, stream5>>>(d2_a, d2_b, d2_c, n);
   cudaEventRecord(e4, stream5);
   // MemCpy D2H of Kernel 1
   cudaStreamWaitEvent(stream3, e2, 0);
   cudaMemcpyAsync(h1_c, d1_c, bytes, cudaMemcpyDeviceToHost, stream3);
   // MemCpy D2H of Kernel 2
   cudaStreamWaitEvent(stream6, e4, 0);
   cudaMemcpyAsync(h2_c, d2_c, bytes, cudaMemcpyDeviceToHost, stream6);


Therefore the dependencies defined are:

stream items in the stream event wait for
stream1 d1_a, d1_b  
stream2 operation1 event1 wait for stream1
stream3 d1_c event2 wait for stream2
stream4 d2_a, d2_b
stream5 operation2 event3 wait for stream4
stream6 d2_c event4 wait for stream5




Although this application does not have strong computational work is enough to illustrate what we want, and with this technique we have obtained a speed-up of 2,5x and 3,2x on Fermi and Kepler architectures respectively:

   No Streams  
Fermi  16,558ms  41,917ms 2,5x
Kepler  16,271ms  52,19ms 3,2x


NVIDIA Visual Profiler allow us to profile and visualize the execution of our application. I used this tool to see what happens in these different situations for the different architectures.

Fermi profile without streams:

fermi no streams

The last row shows the default stream, the stream zero. This stream is the responsible of all operations, both operations and transfers, so it takes more time than whether we split it in serveral ones.


Fermi profile with streams:

fermi streams

This image shows six streams, the last six rows. In this way we are achieving the overlapping we wanted.


I provide the code and the profiling data in this file, so you can check it by yourself. I also recommend this article of the official web page Parallel for All of NVIDIA.