Saturday, May 14, 2011

Some Fragments of CUDA, from Johan`s slides


1. Multiple levels of parallelism.
The first level is grid, which consist of a grid of thread blocks. Each thread block has a number of threads, which can be up to 512 threads.


2. Programming Interface.
CUDA can be seemed as a minimal C extensions and a runtime library, which provide built-in types and a subset of C standard library. A host component, which executed on CPU, controls and access GPUs to run the specified kernel code.
  • Function qualifiers, specify where to call and execute a function;
  • Variable type qualifiers, defines where to allocate space for the variable;
  • Kernel execution directive, defines what is the size of thread block and grid.  

function<<<block_dimension>>>( ... )
3. NVCC compiler
nvcc could automatically handle include files and linking. However, most STL libraries, such as string and iostream, and exceptions are not supported. The CUDA source code file need to be with “.cu” extension.

Operations on GPU Operations on CPU
fast, less accurate functions,
such as __sin(x)
device management
__syncthreads() memory management
Type conversion functions and
type casting functions
texture management
texture functions OpenGL inter-operations
atomic functions asynchronous concurrent
execution
low level API

4. A GPU consist of N multiprocessors(MP) and each MP has M scalar processors. Each MP processes batches of blocks. Each block is split into SIMD groups of threads, which called "warp"s. When on executing, scheduler switches warps!
5. Memory optimization
 For best performance, global memory accesses should be coalesced. All memory allocation should be contiguous, aligned. The Warp base address must be a multiple of 16 * sizeof(type). In addition, the kth thread should access the element at WBA + k.

1 comment: