1. Multiple levels of parallelism.
The first level is grid, which consist of a grid of thread blocks. Each thread block has a number of threads, which can be up to 512 threads.
2. Programming Interface.
CUDA can be seemed as a minimal C extensions and a runtime library, which provide built-in types and a subset of C standard library. A host component, which executed on CPU, controls and access GPUs to run the specified kernel code.
4. A GPU consist of N multiprocessors(MP) and each MP has M scalar processors. Each MP processes batches of blocks. Each block is split into SIMD groups of threads, which called "warp"s. When on executing, scheduler switches warps!The first level is grid, which consist of a grid of thread blocks. Each thread block has a number of threads, which can be up to 512 threads.
2. Programming Interface.
CUDA can be seemed as a minimal C extensions and a runtime library, which provide built-in types and a subset of C standard library. A host component, which executed on CPU, controls and access GPUs to run the specified kernel code.
- Function qualifiers, specify where to call and execute a function;
- Variable type qualifiers, defines where to allocate space for the variable;
- Kernel execution directive, defines what is the size of thread block and grid.
function<<<block_dimension>>>( ... )
3. NVCC compiler
nvcc could automatically handle include files and linking. However, most STL libraries, such as string and iostream, and exceptions are not supported. The CUDA source code file need to be with “.cu” extension. Operations on GPU | Operations on CPU |
fast, less accurate functions, such as __sin(x) | device management |
__syncthreads() | memory management |
Type conversion functions and type casting functions | texture management |
texture functions | OpenGL inter-operations |
atomic functions | asynchronous concurrent execution |
low level API |
5. Memory optimization
For best performance, global memory accesses should be coalesced. All memory allocation should be contiguous, aligned. The Warp base address must be a multiple of 16 * sizeof(type). In addition, the kth thread should access the element at WBA + k.
The format seems ugly.................
ReplyDelete