4. A GPU consist of N multiprocessors(MP) and each MP has M scalar processors. Each MP processes batches of blocks. Each block is split into SIMD groups of threads, which called "warp"s. When on executing, scheduler switches warps!
block_dimension>>>( ... )
3. NVCC compiler
|Operations on GPU||Operations on CPU|
|fast, less accurate functions, |
such as __sin(x)
|Type conversion functions and |
type casting functions
|texture functions||OpenGL inter-operations|
|atomic functions||asynchronous concurrent |
|low level API |
5. Memory optimization
For best performance, global memory accesses should be coalesced. All memory allocation should be contiguous, aligned. The Warp base address must be a multiple of 16 * sizeof(type). In addition, the kth thread should access the element at WBA + k.