opencl logoOpenCL can be benefited by the hardware resources that some architecture have. The overlap between computations and transfers explained on the previous article is also possible to achieved through this language. This means the management of three different flows: host to device transfers, kernel computations and device to host transfers at the same time. In this article I show how we can increase up to 3,5x the kernel execution time (for this example) without modify any line of the kernel code in OpenCL.


We have one kernel with two inputs and one output each (a matrix multiplication):

  •  operation(d_a, d_b, d_c)

We have declared three command queues, one per flow, and two events where one of them is an array of two:

  • inQueue, outQueue, kernelQueue
  • evTransIn[2], evKernel;

Finally, we want a flow similar to this:

Host To Device transfer    d1_a, d1_b   
    d2_a, d2_b   
Kernel computation       operation    operation  
Device To Host transfer           d1_c     d2_c   

The code

First of all we need to keep the memory of our objects in RAM, this is map the memory:

// A buffer, to work on the GPU
memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY, totalBytes, 
mat_A, &clError); // A buffer, to work on the CPU memObjects[3] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,
totalBytes, NULL, &clError); // Link to do usable memObjects[3] buffer on the CPU mat_A = (float *) clEnqueueMapBuffer(inQueue, memObjects[3], CL_TRUE, CL_MAP_WRITE,
0, totalBytes, 0, NULL, NULL, &clError);


This is the loop where we do the computation:

	for (unsigned int k = 0; k < iterations; k++) {
		/************** Write input buffer **************/
		clCheckError(clEnqueueWriteBuffer(inQueue, memObjects[0], blockFlag, 0,
			totalBytes, mat_A, 0, NULL, evTransIn), "Error when writing A buffer");
		clCheckError(clEnqueueWriteBuffer(inQueue, memObjects[1], blockFlag, 0,
			totalBytes, mat_B, 0, NULL, &evTransIn[1]), "Error when writing B buffer");
		/************** Launch Kernel **************/
		size_t localWorkSize[2] = { BLOCK_SIZE, BLOCK_SIZE };
		size_t globalWorkSize[2] = { dim, dim };
		clError = clEnqueueNDRangeKernel(kernelQueue, kernel, 2, NULL,
			globalWorkSize, localWorkSize, 2, evTransIn, &evKernel);
		clCheckError(clError, "Failed to launch kernel");
		/************** Read output buffer **************/
		clCheckError(clEnqueueReadBuffer(outQueue, memObjects[2], blockFlag, 0,
			totalBytes, d_mat_C, 1, &evKernel, NULL), "Failed to read result");


Therefore the dependencies are:

command queue
task wait for
inQueue d1_a (evTransIn[0])
inQueue d1_b (evTransIn[1])
kernelQueue operation wait for 2 events in evTransIn
outQueue d1_c wait for 1 events in evKernel


The full code is available on Github