Install Nvidia Driver. Be able to run nvcc compiler. The name of the file matters - the file extension needs to be ".cu" to be compiled by the cvcc compiler. Install C++ compiler, too.
Run in command line:
nvcc hello.cu -o hello
How to define a cuda function:
__global__ void myFunc(float* output, float* input) {
}
How to allocate CUDA memory
How to free CUDA memory
How to copy data from memory to GPU device
cudaMemcpy(pDestination, pSource, byte_size, cudaMemcpyHostToDevice)
How to copy data from GPU device to memory
cudaMemcpy(pDestination, pSource, byte_size, cudaMemcpyDeviceToHost)
How to let CPU wait for GPU execution result:
cudaDeviceSynchronize();
How to invoke the coda function:
myFunc<<<1,1>>>(output, input);
What does <<<1,1>>> mean?
1 block, and 1 thread per block
How to run more in parallel with GPU?
Pass <<<1, 50>>>
This will run 50 threads. Each thread has a thread local variable: threadIdx.x, whose value is an int from [0, 49], local to the thread.
What does <<<1,50>>> mean?
1 block, and 50 threads per block
Is there threadIdx.y in a CUDA function?
Yes. In fact, threadId.x, threadId.y, threadId.z forms a 3D thread block. Likewise, the number of blocks can be in 3D as well. Each block is a collection of a set of 3D threads. In the coda function, it can be referred as blockIdx.x, blockIdx.y, blockIdx.z.
Is it possible to know the thread block size inside a CUDA function?
Yes. Block size is identified by thread local variable: blockDim.x, blockDim.y, blockDim.z. In addition, grid size (collection of blocks) can also be detected by gridDim.x, gridDim.y, gridDim.z.
How to pass 3D grid of 3D thread blocks?
Define dim grid(4,5,6). Define dim3 block(3,2,1). Pass these in the CUDA function call: myFunc<<<grid, block>>>(...)
Can I have just 1 huge block and 1 million threads?
No. Nvidia GPU will limit it at 1024 threads per block.
I want to synchronize threads in a block?
__syncthreads()
If the number of threads is less than 32, then the threads are always in sync.
Keep if-else to value only:
Given that the parallel threads always executes the same instruction, when an if-else state branches the logic, the branched logic must run in different thread-divergence groups. The group with unsatisfied condition will run in no-op, waiting for its turns. But, if-else that chooses constant can be converted to a clover instruction. Ex: if(vector_size < 100) constant1; else constant 2;
Math Trick
Use right shift to detect if an integer is 0 or positive: 1 + x >> 31, because
0 >> 31 = 0
-1 >> 31 = -1
-99 >> 31 = -1
CUDA Memory Model
Global Memory:
- Long latency (400-800 cycles)
- Throughput 200 GBPS
- Shared by all thread blocks
Texture Memory
- Read-only (12KB)
- ~800 GBPS
- Optimized for 2D spatial locality
- Store meta data. Optimized on access to neighboring cell.
Constant Memory
- Read-only (64 KB)
L2 Cache
- 768KB
- Shared among thread blocks
- Fast Atomics
(Useful for synchronize threads as atomic variables)
L1 & Shared Memory
- Local to a thread block, shared within one thread block.
- 64 KB per thread block
- 16 KB shared, and 48KB L1.
- L1 cannot be programmed. Shared memory can be programmed
- Low latency (20-30 cycles)
- 1 TBPS
Register
- Local to one thread, 21 registers per thread
- 32KB, per thread block
- Handled by compiler
- 8 TBPS
Improve bandwidth:
- share / reuse data
- compression
- Recompute than store + fetch
Latency
- I/O causes time to read/write to GPU
- In practice, memory I/O can become bottleneck many times.
- Latency hiding is done by exploiting multi-threading
Why do we need multiple thread blocks?
- Because a thread block can halt when on I/O operations. When the halt happens, other threadBlock can take the GPU resource and run, achieving better throughput. This is called super-linear speedup.
Locality
- Spatial and Temporal locality
- You will want a page to be loaded and used often before switching to another page. So, it is faster to iterate a 2D array row by row, than iterating column by column.
Memory Coalescing
- Store data in memory within close spatial distance to avoid frequent loading to memory. (Since each memory load takes 32 cycles)
- Coalesced vs. strided: strided data have chunks of data for each object. When data is accessed in strided stored, it will require more memory load. Unrelated fields are loaded without being accessed.
- Random access is the worst. This happens especially when the index to the array depends on a variable.
- In CPU context, each thread should access consecutive element in memory chunks (strided). Array of structures is preferred.
- In GPU context, memory coalescing means a chunk should be accessed by consecutive threads (coalesced). Thus, structure of Arrays is preferred. (Ex: each field of a list of records is stored in 1 array. 1 structure represents 1 table.)
- Tip: in the cuda function, use (blockIdx.x * blockDim.x + threadIdx.x) as index to each field.
Shared Memory:
How to define in __global__ function?
__shared__ float a[N];
__shared__ unsigned s;
How to initialize their value:
if (threadId.x == 0) s = 0;
Shared memory is accessible in a thread block. Shared memory is organized into 32 banks. Access in the same bank are sequential (blocking). Exception case: access the
same address of the same bank in the same warp doesn't block. It is called broadcast. (Note that if 2 threads access 2
different addresses of the same bank will cause blocking.)
Use this command to allocate the 64KB between L1 cache and shared memory:
- cudaDeviceSetCacheConfig(kernelFuncName, cudaFuncCachePreferShared)
- Other options:
- to give more L1 cache: cudaFuncCachePreferL1
- to give equal L1 and shared memory: cudaFuncCachePreferEqual
- no preference of L1 and shared memory: cudaFuncCachePreferNone
- When your shared memory size is larger than the max allowed shared memory, compilation error will occur
- The larger of the shared memory, the less of the L1 cache.
Dynamic Shared Memory:
To vary the shared memory size in a function, pass the shared memory size via calling with 3 parameters: <<<numBlocks, numThreadsPerBlock, sizeOfSharedMemory>>>
Inside the kernel function, define an array pointer for the dynamic shared memory:
__global__ void kernelWithDynamicSharedMemory() {
extern __shared int s[];
Can I define multiple arrays? Yes, but there is only one extern __shared__ array. Split the array by pointer inside the method.
Tex2D memory
Define variable: texture<float, 2 cudaReadModeElementType> texRef;
In main: cudaBindTextureToArray(texRef, cuArray, ...);
In kernel: tex2D(texRef, x, y);
Constant memory
It is read-only, and 64KB per SM. Define
Define variable: __const__ unsigned meta[1];
In main: cudaMemcpyToSymbol(meta, &hmeta, 1*size(unsigned));
In kernel: a = meta[0]
Parallel in block vs in threads:
- Each block is assigned to a different StreamingProcessor (SM). The blocks do not communicate with each other.
- Each thread runs at different cores of a StreamingProcessor. Threads are bundled into warps. Each warp has 32 threads.
Host Memory: the memory accessed by CPU
Device Memory: the memory accessed by GPU
Host Memory is allocated by malloc. Host memory can be mapped to OS memory paging. Use cudaMallocHost to avoid OS memory paging.
To transfer data from Host Memory to Device, use Asynchronous memory transfer, and ideally copy all memory to device at once:
cudaStreamCreate(&stream2)
cudaMemcpyAsync(dest2, src2, size, dir, stream2)
Each SM has 16KB of shared memory. The shared memory is essentially a user managed cache. The latency of the shared memory is comparable to registers.
Each thread has its dedicated registers. A block of threads shares shared memory. Meanwhile, blocks shares device memory.
Use structure of arrays, avoid structure of structures (always use separate arrays, even for x,y,z). For x,y,z, they can have the same float3 pointer to dereference into 3 adjacent values (to avoid dereferencing 3 times).
Memory is divided into banks. Each bank can service 1 thread per cycle. When accesses a blank is in conflict, serialize the accesses. (Here sounds like a lock)
We said shared memory is as fast as register, but that's true only when bank conflict is avoided. When a bank is accessed by 8 threads, the linear access stride = 8. Conflict can be detected by visual profiler (to look for warp_serialize events).
- The fast case: if all t threads of a half-warp access different banks, there is no bank conflict. If all threads of a half-wrap read the identical address, there is no bank conflict. (It is called broadcast.)
- The slow case: Bank conflict happened - multiple threads in the same half-warp access the same bank. Accesses must be serialized. Cost = max $ of simultaneous access on a single bank.
- (What is half-war access? 1 warp is 32 threads. Half-warp is 16 threads. Different warps can take different branches. 1 SM can run many warps of a block. Warps in a block can synchronize.
A Single-instruction can run at multiple threads at the same time. A GPU has thousands of Registers and Registered are partitioned over threads.
Each SM has 16kb of shared memory. The blocks that require less than 8kb of shared memory can be handled in the same SM. Thus, warps can be from different blocks.
Threads per block should be multiple of 32 to have maximum threads per block. More warps per block creates deeper pipeline. And in this way it hide latency. However, then umber of threads is limited by the number of Registers. Use less than 8kb of shared memory to have multiple warps on the same SM. (So, this is a trade-off to the size of shared memory.) Recompute is often faster than loading from memory. Applying recompute is an optimization direction.
Regarding cudaMemCpy (device memory): Every successive 128 bytes (or 32 single precision values) can be access by a warp. All 32 reads should be done in one step. When memory is not in successive 128 blocks of memory, it would take twice of the time to read.
Regarding cudaMemcpytoSymbol (constant memory): copy to __constant__ float var, the cache works the best when a warp reads the same value. (Cuda 6 has a new device memory model)
Latency: in cycles
Throughput: cores / sm
The needed Parallelism: operations / SM = latency * throughput. This is called arithmetic. parallelism.
Consider Instruction-level parallelism (ILP).
- ILP is a trick to let a CUDA function to calculate and return multiple values. It is often done by duplicating the logic and duplicate the variable names. It is a tradeoff by using more Registers per thread, and use less threads. ILP is the number of independent instructions in a loop.
- As of seen in the "Better Performance at Lower Occupancy" article, larger ILP level requires less threads to achieve 100% utilization. With ILP=3, it only needs half of the threads (from 576 to 256 threads) to achieve 100% utilization. However ILP doesn't scale up to 4.
Another trick was to hide memory latency, by keeping 100KB in the flight. An example is to issue multiple independent reads by reading values in an array into multiple local variables:
__global__ void memcpy( float *dst, float *src )
{
int iblock= blockIdx.x + blockIdx.y * gridDim.x;
int index = threadIdx.x + 2 * iblock * blockDim.x;
float a0 = src[index];
//no latency stall
float a1 = src[index+blockDim.x];
//stall
dst[index] = a0;
dst[index+blockDim.x] = a1;
}
By copying 14 float4 values per thread, the app runs with 4% of occupancy to hide 84 of utilization.
CUDA Function Declarations and their meanings:
__device__ float deviceFunc() // call device from a device;
__global__ float kernelFunc() // call device from host; must return void
__host__ float hostFunc() // call host from host
A function can have multiple function types. Example:
__host__ __device__ void dhfun() { ...
cudaHostAlloc : allocates memory in the Host in page-locked form (which won't be written to the disk). It can be transferred faster between host and device. This memory can also be accessed by GPU via DMA, avoiding a memory copy.
A __global__ function can be called from both device and host.
Note: Variables cannot be declared as __host__. Only __device__ variable. Calling a __device__ variable from host will pass compilation but will reach error in runtime.
Thrust is a parallel algorithms library for CUDA, similar to STL on CPU
NVIDIA/thrust: [ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
- Supports vectors and associated transformers.
- It is black of where code executes.
Mutual Exclusion Algorithm:
Bakery Algorithm
- By assigned token number to each thread
- If there exists another thread waiting and holding a token less than my token, wait.
- After the processing in my thread is done, register that my thread is done.
- N-thread mutually exclusive
Atomic Operations
- cuda operations: atomicCAS, atomicMin, atomicAdd, atomicCAS, etc.
- atomicCAS: takes in 3 variables: address, compare value, new value.
- If compare value == value in the address, new the new value. Otherwise do nothing.
- Return old value in the address
- (if old value == compare value, that means new value has been taken. Otherwise, it means the new value wasn't taken. Use this to create critical section for a single thread to enter.)
- Works on GPU across warps
- But it hangs for threads belonging to the same wrap, because 1 warp-thread acquires the lock and waits for other warp threads to reach the instruction, while other threads in the warp await this successful thread in the do-while.
- The correct way to is use atomicCAS in do-while, and check with an if statement after entering the critical section. and upon exit of the critical section, unlock by setting the compared address to a value that can match comparison in a different thread.
- Example:
- do {
- old = atomicCAS(&lockvar, 0, 1);
- if(old == 0) {
- // Do your task in the critical section
- *lockvar = 0; // unlock by setting it to 0 so that other thread can pick it up.
- }
- } while(old != 0)
- Note that with CPU, you can unlock outside of the while loop. But with GPU, the unlock needs to happen in the while loop.
- Can also be used to enforce to a single run from many threads. (Just one if statement with atomicCAS)
Barriers
- It is a program point where all threads need to reach before any thread can proceed.
- End of kernel is an implicit barrier for all GPU threads (global barrier)
- CUDA 9 supports grid.sync() to make explicit global barrier.
- Threads in a thread-block can synchronize using __synchthreads()
- __synchthreads also creates memory barrier so that write (of shared memory) in one thread is visible to other threads in the block.
- __threadfence() ensures that writes to global and shared memory are visible to all other threads on the device.
- __threadfence_block(): write visibility within the same block
- __threadfence_system(): write visibility across both the device and the host (CPU)
- __threadfence(): write visibility across the device
- Note that a threadfence only ensures visibility of memory write. It does not block threads.
Reductions
- It is process of converting a set of values into fewer values
- Reducible operation must satisfy associativity property (which means apply order doesn't change the outcome)
- Min, Max, Sum, XOR
- Can be implemented with atomics, but that adds sequentiality.
- A better approach with improved parallelism in GPU:
- Example: sum the adjacent pair of values in different threads. So they don't block each other. Keep doing this until there is only 1 value.
- Complexity measurement:
- Takes log(n) steps. First step runs n threads. Second step runs n/2 threads, ... , the last step runs 1 thread.
- Prefix Sum:
- Algorithm 1: for each value in the list, sum it with the previous value. Repeat the process to get the previous value from 2 cells back, 4 cells back, etc.
- datarace: a thread is reading a value in memory while another thread is writing to it, there's a datarace. __synchthreads() should be placed in the middle of the read state and write statement.
Debugging:
- Use Cuda-gdb https://docs.nvidia.com/cuda/cuda-gdb/index.html
- compile with: nvcc -g -G file.cu
- Run the output from the previous step with: cuda-gdb a.out
- info cuda kernels - shows device status
- info threads - shows execution of all threads.
- info cuda sms - show streaming multi processors
- info cuda warps - show warps
- info cuda lanes - show information related to each thread.
- set break point by:
- break main - set break point in main function
- break file.cu:223 - set break point in file at line number
- set cuda break_on_launch - kernel entry breakpoint
- break file.cu:23 if threadIdx.x == 1 && i < 5 - conditional break point
Profiling:
- Time taken by kernels
- Memory utilization
- Cache misses
- Divergence
- Coalescing
Use nvprof to run the app. It tells the % of time used by each kernel. And % of time of each cuda command.
Dynamic Parallelism should be unrolled when possible (dynamic parallelism was supported until architecture 35. It can be specified during compilation.
nvcc -arch=sm_35 dynpar.cu
A global function can invoke another global function for parallel processing in a device.
Parent kernel is associated with a parent grid. Child kernels are associated with child grids. Parent and child kernel shares the global and constant memory, but they have distinct local and shared memories. Global memory operation in the parent is visible to the child. All global memory operation in child is visible to parent when parent calls cudaDeviceSynchronize().
Multi-GPU
- One host (CPU) controls multiple GPUs
- Use cudaSetDevice(i) before performing any cuda operation (ex: cudaMemcpy)
- cudaGetDeviceCount
- cudaDeviceCanAccessPeer - a device accesses another device
Warp Voting
- __ballot - wrap which warp threads satisfy the predicate.
- Return the test of the predicate in a warp as a 32-bit number.
- __all - all warp threads satisfy the predicate
- __any - any warp threads satisfy the predicate
- Application:
- Example: warp voting for atomics: on if(condition) atomicInc a counter. This can be easily replaced by counting the bits of 1's in the ballot.
- Use __popc(mask) to return the number of set bits.
- This would allow atomicAdd of 32 threads in 1 operation, reducing the blocking from atomic operation.
References:
https://www.youtube.com/watch?v=cvo3gnInQ7M&list=PL1ysOEBe5977vlocXuRt6KBCYu_sdu1Ru&index=1
https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Intro_to_CUDA_C-TS.pdf
https://www.nvidia.com/content/cudazone/download/Advanced_CUDA_Training_NVISION08.pdf
https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf
https://www.ce.jhu.edu/dalrymple/classes/602/Class13.pdf