Flying with bugs...: Some Fragments of CUDA, from Johan`s slides

Saturday, May 14, 2011

Some Fragments of CUDA, from Johan`s slides

1. Multiple levels of parallelism.
The first level is grid, which consist of a grid of thread blocks. Each thread block has a number of threads, which can be up to 512 threads.

2. Programming Interface.
CUDA can be seemed as a minimal C extensions and a runtime library, which provide built-in types and a subset of C standard library. A host component, which executed on CPU, controls and access GPUs to run the specified kernel code.

Function qualifiers, specify where to call and execute a function;

Variable type qualifiers, defines where to allocate space for the variable;

Kernel execution directive, defines what is the size of thread block and grid.

function<<<block_dimension>>>( ... )

3. NVCC compiler

nvcc could automatically handle include files and linking. However, most STL libraries, such as string and iostream, and exceptions are not supported. The CUDA source code file need to be with “.cu” extension.

Operations on GPU	Operations on CPU
fast, less accurate functions, such as __sin(x)	device management
__syncthreads()	memory management
Type conversion functions and type casting functions	texture management
texture functions	OpenGL inter-operations
atomic functions	asynchronous concurrent execution
	low level API

4. A GPU consist of N multiprocessors(MP) and each MP has M scalar processors. Each MP processes batches of blocks. Each block is split into SIMD groups of threads, which called "warp"s. When on executing, scheduler switches warps!
5. Memory optimization
For best performance, global memory accesses should be coalesced. All memory allocation should be contiguous, aligned. The Warp base address must be a multiple of 16 * sizeof(type). In addition, the kth thread should access the element at WBA + k.

1 comment:

UnknownMay 14, 2011 at 2:12 PM
The format seems ugly.................
ReplyDelete
Replies

Add comment