Week 5 -> Optimizing GPU Programs (continued)

Goal: Maximize useful computation/second

After walking through APOD [Analyze, parallelize, optimize, deploy] the lecture turned to memory bandwidth. Using the CUDA utility deviceQuery to calculate memory bandwidth using memory clock rate and memory bus width.

After determining the maximum theoretical bandwidth at 40 Gb/s practical goals were set:

  • 40-60% -> OK
  • 60-75% -> good
  • >75% -> excellent
Using 12.5 GB/s of memory bandwidth is under utilizing!
Using 12.5 GB/s of memory bandwidth is under utilizing!

The algorithm being analyzed is a transpose. When DRAM utilization is low the first guess should be lack of coalescing.

In our example code we find that the reading of memory is well coalesced.. but the writing phase is strided by N elements (1024). This was described as ‘bad’.

So considering that most GPU codes are memory limited checking memory bandwidth utilization is very important.

Enter a tool – nSightm nvpp [nvidia visual profiler]  – confirms that the write to memory operations are utilizing very little memory bandwidth whilst the read  operations are at 100% utilization.

The solution for this algorithm is ’tiling’. Tiling utilizes shared memory, taking a tile of the input copying and transposing into output. The code for this solution can be found here: https://github.com/udacity/cs344/blob/master/Unit5%20Code%20Snippets/transpose.cu

Occupancy – Each SM has a limited number of:

  • Thread blocks -> 8
  • Threadss -> 1536/2048
  • registers for all threads -> 65536
  • bytes of shared memory 16K-48K

Maximizing number of threads on the SM [streaming multi processor] will maximize occupancy. The limits for specific hardware can be found via deviceQuery.

The transpose code was further optimized, better versions can be seen in the link above.

Shared memory bank conflicts ->organized into banks and depending on how threads access memory in tile, replays of shared memory accesses can occur. (ie: striping shared memory usage across banks).

Referring back to the goal of maximizing effective computation we have just address one factor:

  • minimize time waiting at barriers

The next to be address was:

  • minimize thread divergence

Some important definitions:

  • warp – set of threads that execute the same instruction at a time
  • SIMD – Single instruction, multiple data (CPU
  • SIMT – Single instruction, multiple thread

Thread divergence is can result in up to 32x slower code. Warps on nvidia hardware have 32 threads which apply single instruction to multiple threads.

Next topic was Streams, launching kernels in separate streams allows for concurrent execution:

Benefits of specifying CUDA streams

To create streams:

cudaSteam_t s1;



asynchronous memory transfer  – cudaMemcpyAsync – called on pinned memory.

Week 5 lectures

Week 5 code – failed to get shared memory working even with atomicAdd 🙁