Introduction
In this article I'll discuss about of the high performance that a graphic card or coprocessor give us. The peak of performance is quite higher than in CPU for parallel problems, so we can resolve the same problem with less energy. OpenCL C is the standard language to parallelize algorithms in GPU, so that if you want to achieve greatest results the code can turn to more complex. To illustrate this I'll to resolve in both devices a typical example in parallel programming, iterative PI calculation. The straightforward code in C++ is the next:
float pi_cpu(int niter) { float pi, x; const float h = 1.0 / (float) niter; pi = 0.0; for(int i=0; i<niter; i++) { x = h * ((float) i - 0.5); pi += 4.0 / (1.0 + x*x); } return h * pi; }
Where n is the number of iterations and segments in which the area is divided. Therefore when n is big enough, the result is approximately PI.
Solving the problem
I have written two different kernels to resolve this problem on GPU platforms. None of them is the best solution, although they get better performance than CPU implementation. You can download full code from GitHub.
Common tasks in both kernels:
- Compute a stride of the total iterations
- Save the sum in shared memory
- After kernel finish, CPU compute the sum of all elements of workgroups
On the first kernel, the thread whose local id is zero does the sum of the all elements in the array and then, saves the result in global memory. So we have a performance leak because only one thread per block is doing work in the reduction time.
The second one has a behavior some different. Each thread iterates log_{2}(workgroup-size) times through a loop that performs a reduction operation of shared array among the block. It means that we have a performance leak too, because on the first iteration only workgroup-size/2 threads work together, on the second workgroup-size/4, on the third workgroup-size/8, etc.
Results
As you know, heterogeneous computing wins the challenge. My GPU device only has eight multi-processors with eight processor-units each, that is enough to show all the GPU power. Let's to show the timings. When we execute the program we can see something similiar to this:
The output tell us how long has taken the calculation in both devices and what is the accuracy as well as device info and settings. Testing with different number of iterations:
Number of iterations |
Kernel 2 Elapsed time in us |
CPU Elapsed time in us |
2⁹64=32768, chunksize=64 |
169 |
539 |
2¹⁰64=65536, chunksize=64 |
147 |
1118 |
2¹²64=262144, chunksize=64 |
151 |
4316 |
2¹³64=524288, chunksize=64 |
217 |
8944 |
2¹⁴64=1048576, chunksize=64 |
304 |
17723 |
Although with a graphic is more clear:
To conclude, I tested the two kernels. The first one with a simple reduction and the second one with a logarithmic reduction that is not the best. The better implementation uses all hardware in every cycle.
Number of iterations |
Kernel 1 Elapsed time in us |
Kernel 2 Elapsed time in us |
2²¹64=134217728, chunksize=64 |
28219 |
26252 |
2²²64=268435456, chunksize=64 |
56479 |
51758 |
2²³64=536870912, chunksize=64 |
112669 |
103251 |
2²⁴64=1073741824, chunksize=64 |
225098 |
206182 |
The respective graph:
There are several methods to tune our codes, one of them is improving the reduction operations.