# Blog

From long time ago threads are needed in programming to overlap operations which don't depend between them. Nowadays, with the multi-cores is something basic for every application if you want to achieve a good performance. This introduce some problems like accessing shared variables at the same time by several threads or synchronize tasks on the code. I will try to cover how to deal with these topics on C++11 analyzing the executions with helgrind, a Valgrind tool.

LU factorization is a popular method to decompose a matrix as the product of a Lower triangle and an Upper triangle matrices. Lot of simulations perform the solving of a given system of linear equations. In general, the resolution of these systems is quite hard so matrix decomposition method is frequently used to subdivide the problem. As it's easier to solve a triangular matrix than a standard matrix, this is usually the first step on many simulations.

# Introduction

OpenCL can be benefited by the hardware resources that some architecture have. The overlap between computations and transfers explained on the previous article is also possible to achieved through this language. This means the management of three different flows: host to device transfers, kernel computations and device to host transfers at the same time. In this article I show how we can increase up to 3,5x the kernel execution time (for this example) without modify any line of the kernel code in OpenCL.