Working With CUDA

We’ve recently been working with a cool technology that is rapidly penetrating scientific and engineering computing, but seems little known otherwise. It’s called CUDA. In a nutshell, it is an SDK to allow you to run parallelizable compute-intensive applications on your Nvidia graphics card instead of serially on your CPU.

CUDA is one of a number of emerging methods that all more or less enable the same concept. A similar approach is behind OpenCL, which is backed by Apple and purports to be more cross-platform, in that it eventually will allow developers to develop code for a range of GPUs and not just those from Nvidia. However, at the moment OpenCL is limited to Macs running Apple’s new Snow Leopard, so I’m not sure that’s more open than CUDA.

Our use of CUDA involves running a moderately intensive frame-based image-processing algorithm on extremely high-definition images. The images are 3840×1080 monsters, and the goal is to process a total of 24 such images per second, so it’s definitely not something you can accomplish on a CPU alone.

Our client in this case had originally implemented the image-processing algorithm in MATLAB, so our first task was to convert from MATLAB to C. The resulting C code was the basis for using Nvidia’s CUDA SDK to get the algorithm running on the Nvidia board. We also had to tie the entire framework into Microsoft’s DirectShow architecture, because the data is delivered to us as H.264 encoded images.

The parallelization comes through the GPU chip’s many stream processors—also called thread processors. For example, the GeForce 8 GPU has 128 stream processors. Each stream processor comprises 1024 registers and a 32 bit floating point unit. The stream processors are grouped on the chip into clusters of 16. Each cluster shares a common 16 KB memory and is referred to as a “core.”

From a software perspective, each stream processor executes multiple threads, each of which has its own memory, stack, and register file. For the GeForce 8 GPU, each stream processor can handle 96 concurrent threads (for a total of 12,288) although this number is seldom reached in practice. Fortunately, programmers do not need to write explicitly threaded code since a hardware thread manager handles threading automatically. The biggest challenge for the programmer is to properly decompose the data so that the 128 stream processors stay active.

In an efficient decomposition, the data is subdivided into chunks that can be allocated to a core. Algorithms execute most efficiently when the data for the threads executing on a core can all be stored in the core’s local memory. Data is transferred back and forth between the host CPU and the GPU’s global memory (i.e. graphic device memory) via DMA. Data is then transferred back and forth from device memory to core memory as needed. An efficient implementation will minimize the number of device to core data transfers.

In CUDA terminology, a block of threads runs on each core. While one thread block is processing its data, other thread blocks (running on other cores) are performing identical operations in parallel on other data. Thread blocks are required to execute independently and they must be able to be executed in any order. To specify the operations of a thread block, programmers define “C” functions, called kernels. The kernel function is executed by each thread in a thread block. Fortunately, there is specific CUDA support for synchronizing the threads within a block.

We have enjoyed this project immensely and are excited by CUDA’s prospects. We look forward to using CUDA on other projects in the future. More information on CUDA can be found on Nvidia’s website. I found this article particularly useful as an introduction.

Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.