Week 5 -> Optimizing GPU Programs.

Parallelizing and porting programs to run on CUDA is generally done to either solve bigger problems or to solve more problems. So optimizing programs to require less time may be beneficial.

It is important to note that optimization should be completed with reference to the goals of the program and the execution time of each part of the program. Once a function is no longer a performance bottle neck the returns on further optimization are likely to be diminished.

– decrease time spent on memory operations
– coalesce global memory access
– avoid thread divergence

These basic principles do have exceptions. For instance, the transfer of data from global to shared memory may increase time on memory operations but decrease overall execution time.

The lecture highlighted some types of optimization:

  1. Selecting the right algorithm -> Likely to have the largest impact
  2. Applying the basic principles for efficiency
  3. Architecture specific optimizations
  4. Micro optimizations (instruction level)

A methodology for the development process of parallel applications was subsequently suggested:

  1. Analyze -> Profile the applications identifying bottlenecks/hotspots
  2. Parallelize -> using approaches such as libraries/OpenMP/CUDA, also selecting algorithms
  3. Optimize -> Measurement focus
  4. Deploy -> Optimization should not be completed in a vacuum it is too difficult to predict and emulate real usage


A simple working example on optimizing the transposing matrices followed.

Timing the function from a standard serial implementation to moderately parallel example and finally implementing our own fully parallel code. The code for the example: week5_example1.cu

Focusing too much on a single function will generally yield diminishing returns
Focusing too much on a single function will generally yield diminishing returns