Introduction

opencl logoIn this article I'll discuss about of the high performance that a graphic card or coprocessor give us. The peak of performance is quite higher than in CPU for parallel problems, so we can resolve the same problem with less energy. OpenCL C is the standard language to parallelize algorithms in GPU, so that if you want to achieve greatest results the code can turn to more complex. To illustrate this I'll to resolve in both devices a typical example in parallel programming, iterative PI calculation. The straightforward code in C++ is the next:

 

float pi_cpu(int niter) {
        float pi, x;
        const float h = 1.0 / (float) niter;
        pi = 0.0;
        for(int i=0; i<niter; i++) {
            x = h * ((float) i - 0.5);
            pi += 4.0 / (1.0 + x*x);
        }   
        return h * pi; 
}

Where n is the number of iterations and segments in which the area is divided. Therefore when n is big enough, the result is approximately PI.

 

Solving the problem

I have written two different kernels to resolve this problem on GPU platforms. None of them is the best solution, although they get better performance than CPU implementation. You can download full code from GitHub.

Common tasks in both kernels:

  • Compute a stride of the total iterations
  • Save the sum in shared memory
  • After kernel finish, CPU compute the sum of all elements of workgroups

On the first kernel, the thread whose local id is zero does the sum of the all elements in the array and then, saves the result in global memory. So we have a performance leak because only one thread per block is doing work in the reduction time.

The second one has a behavior some different. Each thread iterates log2(workgroup-size) times through a loop that performs a reduction operation of shared array among the block. It means that we have a performance leak too, because on the first iteration only workgroup-size/2 threads work together, on the second workgroup-size/4, on the third workgroup-size/8, etc.

 

Results

As you know, heterogeneous computing wins the challenge. My GPU device only has eight multi-processors with eight processor-units each, that is enough to show all the GPU power. Let's to show the timings. When we execute the program we can see something similiar to this:

execution

The output tell us how long has taken the calculation in both devices and what is the accuracy as well as device info and settings. Testing with different number of iterations:

Number of iterations

Kernel 2 Elapsed time in us

CPU Elapsed time in us

2⁹64=32768, chunksize=64

169

539

2¹⁰64=65536, chunksize=64

147

1118

2¹²64=262144, chunksize=64

151

4316

2¹³64=524288, chunksize=64

217

8944

2¹⁴64=1048576, chunksize=64

304

17723

 

Although with a graphic is more clear:

cpu-vs-gpu

 

To conclude, I tested the two kernels. The first one with a simple reduction and the second one with a logarithmic reduction that is not the best. The better implementation uses all hardware in every cycle.

Number of iterations

Kernel 1 Elapsed time in us

Kernel 2 Elapsed time in us

2²¹64=134217728, chunksize=64

28219

26252

2²²64=268435456, chunksize=64

56479

51758

2²³64=536870912, chunksize=64

112669

103251

2²⁴64=1073741824, chunksize=64

225098

206182

 

The respective graph:

k1vsk2

There are several methods to tune our codes, one of them is improving the reduction operations.