Introduction to Parallel Programming – Week 5 Part 2

Week 5 -> Optimizing GPU Programs (continued)

Goal: Maximize useful computation/second

After walking through APOD [Analyze, parallelize, optimize, deploy] the lecture turned to memory bandwidth. Using the CUDA utility deviceQuery to calculate memory bandwidth using memory clock rate and memory bus width.

After determining the maximum theoretical bandwidth at 40 Gb/s practical goals were set:

  • 40-60% -> OK
  • 60-75% -> good
  • >75% -> excellent
Using 12.5 GB/s of memory bandwidth is under utilizing!
Using 12.5 GB/s of memory bandwidth is under utilizing!

The algorithm being analyzed is a transpose. When DRAM utilization is low the first guess should be lack of coalescing.

In our example code we find that the reading of memory is well coalesced.. but the writing phase is strided by N elements (1024). This was described as ‘bad’.

So considering that most GPU codes are memory limited checking memory bandwidth utilization is very important.

Enter a tool – nSightm nvpp [nvidia visual profiler]  – confirms that the write to memory operations are utilizing very little memory bandwidth whilst the read  operations are at 100% utilization.

The solution for this algorithm is ’tiling’. Tiling utilizes shared memory, taking a tile of the input copying and transposing into output. The code for this solution can be found here:

Occupancy – Each SM has a limited number of:

  • Thread blocks -> 8
  • Threadss -> 1536/2048
  • registers for all threads -> 65536
  • bytes of shared memory 16K-48K

Maximizing number of threads on the SM [streaming multi processor] will maximize occupancy. The limits for specific hardware can be found via deviceQuery.

The transpose code was further optimized, better versions can be seen in the link above.

Shared memory bank conflicts ->organized into banks and depending on how threads access memory in tile, replays of shared memory accesses can occur. (ie: striping shared memory usage across banks).

Referring back to the goal of maximizing effective computation we have just address one factor:

  • minimize time waiting at barriers

The next to be address was:

  • minimize thread divergence

Some important definitions:

  • warp – set of threads that execute the same instruction at a time
  • SIMD – Single instruction, multiple data (CPU
  • SIMT – Single instruction, multiple thread

Thread divergence is can result in up to 32x slower code. Warps on nvidia hardware have 32 threads which apply single instruction to multiple threads.

Next topic was Streams, launching kernels in separate streams allows for concurrent execution:

Benefits of specifying CUDA streams

To create streams:

cudaSteam_t s1;



asynchronous memory transfer  – cudaMemcpyAsync – called on pinned memory.

Week 5 lectures

Week 5 code – failed to get shared memory working even with atomicAdd 🙁

Introduction to Parallel Programming – Week 5 Part 1

Week 5 -> Optimizing GPU Programs.

Parallelizing and porting programs to run on CUDA is generally done to either solve bigger problems or to solve more problems. So optimizing programs to require less time may be beneficial.

It is important to note that optimization should be completed with reference to the goals of the program and the execution time of each part of the program. Once a function is no longer a performance bottle neck the returns on further optimization are likely to be diminished.

– decrease time spent on memory operations
– coalesce global memory access
– avoid thread divergence

These basic principles do have exceptions. For instance, the transfer of data from global to shared memory may increase time on memory operations but decrease overall execution time.

The lecture highlighted some types of optimization:

  1. Selecting the right algorithm -> Likely to have the largest impact
  2. Applying the basic principles for efficiency
  3. Architecture specific optimizations
  4. Micro optimizations (instruction level)

A methodology for the development process of parallel applications was subsequently suggested:

  1. Analyze -> Profile the applications identifying bottlenecks/hotspots
  2. Parallelize -> using approaches such as libraries/OpenMP/CUDA, also selecting algorithms
  3. Optimize -> Measurement focus
  4. Deploy -> Optimization should not be completed in a vacuum it is too difficult to predict and emulate real usage


A simple working example on optimizing the transposing matrices followed.

Timing the function from a standard serial implementation to moderately parallel example and finally implementing our own fully parallel code. The code for the example:

Focusing too much on a single function will generally yield diminishing returns
Focusing too much on a single function will generally yield diminishing returns


Introduction to Parallel Programming – Week 4 (Part 2)

Part 2 of week 4’s lectures delved into Sort algorithms.

Sort is difficult in parallel programming because things get more complex when adding the following objectives:

  • keeping the hardware busy 
  • Limiting branch divergence
  • Coalescing memory access

Odd-Even Sort (brick sort), parallel version of bubble sort:

Start with an array of elements comparing odds and evens with different polarity each iteration. The elements are swapped if they are out of order.

Step complexity: O(n), Work Complexity: O(n^2)

Merge Sort

Divide and conquer approach, ideal for GPUs.

Merging two sort lists together, n sorted, n/2 sorted, n/4 sorted…

Step complexity: O(log n), Work complexity: O(n log n)

General implementations use a different sort to get to 1024 chunk sorted lists.

Merging 2 sorted can be done in parallel, remembering compact.

How parallel merge works:

  1. With 2 sorted arrays launch a thread for every element in each array. The goal of each thread is to calculate the position of its element in the final list.
  2. Thread calculates it position in its own list, its index…
  3. Calculate its position in the other array -> Binary search 🙁 every thread does a binary search on the other array.
The different stages of merge sort require different algorithms to get the most out of hardware
The different stages of merge sort require different algorithms to get the most out of hardware

When we get to the top of the merge tree, allocating one huge task to only 1 streaming multiprocessor [SM] means that there will be however many other SMs we have sitting idle. So we want one merge to be spread across multiple SMs.

How do we make sub tasks? Spliters. The logic does not seem to complex but I will review the section on it below:


Moving away from the merge sort, a new approach was explored – Sorting networks. This is a form of oblivious algorithms, ie an algorithm that does the same thing all the time. This enables high level of parallelism.

Bitonic sequences Work complexity: (n log n^2), however step complexity

All scenarios take the same amount of time, the algorithm is oblivious! If input set are small enough to fit into shared memory then sorting networks are generally good performers.

The best performing algorithm on GPUs is Radix sort – the steps for radix sort on integers is:

  1. Start with the least significant bit of the integer
  2. split input into 2 sets, 1s and 0s – (Compact – see part 1)
  3. Proceed to next LSB and repeat’

Step complexity O(kn) where k is bits in representation and n is number of elements to sort.

The final sort algorithms discussed were Quicksort and Key Value sort. Quicksort requires recursion and the control structure is quite difficult to implement.

Week 4 Lectures

Week 4 Assignment


Introduction to Parallel Programming – Week 4 (Part 1)

The topic of week 4’s lectures and assignment was Sort and Scan. These are slightly more complex due to the many-to-many, all-to-all communication patterns.

Part 1 focused on Scan.

Important properties of Scan, work complexity: O(n),  step complexity: O(log n).

Variations on the Scan algorithm:

  • Compact/Filter – Gathering a subset that meet a certain criteria (input, predicate, output). Ie only keep input elements where the predicate is true. Outputs can be dense or sparse. Sparse is easy, as elements map to the same locations. Dense will result in contiguous array of the filtered elements (much faster for subsequent reading and writing). 
why is using compact more efficient?
why is using compact more efficient?

In the card deck example where we want the output of only diamonds using compact will be much faster if the computecard() function is above a minimal computational cost. So compact is very useful when the number of elelments filtered out is large and the computation on each surviving element  is high.

Summary of how to Compact:

  1. Predicate
  2. Scan-In Array: 1 / 0
  3. Exclusive sum scan (scatter addresses for the compacted array)
  4. Scatter input input element into output array using the scatter address addresses.

The predicate and scan parts of Compact will have the same run time regardless of the number of elements filtered. If very few items are to be written to the output then the scatter step will be faster.

A brief side not in the lecture introduced Allocate is a generalized version of Compact which can be used for clipping.

Next came Segmented Scan, we want to keep our kernels busy and independent. Segmented scan uses a second array to keep track of segment heads. This allows independent computation of scan values that can be reconciled when there is less work to do.

Sparse Matrix/Dense Vector multiplication [SpMv]  was a practical application for the techniques we had just learned. SpMv is a good practical example as they are common in many problems domains, ie: (Page Ranking by search engines).

Traditional way to represent a spare matrix is a Compressed Sparse Row [CSR] – Value (non-zero data element values), Column (column of elements origin), RowPtr (element index that starts a new row)

Nice and simple example of CSR calculation
Nice and simple example of CSR calculation


Now that we have the CSR representation of the sparse matrix we can conduct the multiplication.

  1. Segment the Value array using the  rowptr values as demarkation points
  2. Gather the vector values for each element in the array using the column array
  3. Use a Map function to multiply the value array with the gathered vector array.
  4. Conduct and exclusve segmented sum scan to get the final answer
Example of the steps above
Example of the steps above

Week 4 Lectures

Week 4 Assignment

Introduction to Parallel Programming – Week 3

Unit 3 worked on analyse the performance of CUDA programs along with some key algorithms: Reduce, Scan and Histogram. Parallelism is easiest with one-to-one or many-to-one communication patterns week 3 looked into how we still get great performance gains from parallelism when we need to do all-to-all and many-to-many communication patterns between threads. Just because ideal scaling may not be possible for an algorithm parallelism is likely still going to result in more efficient computing.

Step and work complexity are the first steps to establishing the truthfulness of this claim.

A parallel algorithm will be considered 'work efficient' if its work complexity is asymptotically to same as its serial counterpart
A parallel algorithm will be considered ‘work efficient’ if its work complexity is asymptotically to same as its serial counterpart

Comparing the step and work complexity of parallel algorithms is a good start to understanding their efficiency.

If we can reduce the step complexity in out parallel implementation compared to the serial implementation whilst maintaining work efficency we will have a faster execution.


Reduce has 2 inputs:

  1. Set of elements
  2. Reduction operation (ie: +)

Reduction operators must be binary, operate on two inputs and generate one output.

Reduction operators must be associative, a op b op c == c op b op a

eg: 1 + 2 + 3 == 3 + 2 + 1, whereas 1 – 2 – 3 != 3 – 1 – 2

So, some operators that are both binary and associative are: multiplication, minimum, logical OR, bitwise AND

After running through some examples of reduce including an implementation using shared memory in CUDA we  looked at Scan.

Inclusive vs Exclusive, Hillis Steele vs Blelloch where covered. Finally histograms was covered along with a quick touch on sort.

For the assignment I used a diagram from to implement the solution.

Understanding this process is much easier with a diagram!
Understanding this process is much easier with a diagram!

Week 3 Lectures

Week 3 Assignment

Introduction to Parallel Programming – Week 2

After learning the GPU programming model and writing a basic CUDA program Week 2 introduces some concepts for efficient algorithms.

Communication is the first issues; how do threads communicate efficiently. This is easier in some problems, ie: Map than others ie: Scan/Sort. The memory access and write patterns  of some of the key algorithm types are discussed. Input-to-Output relationships:

  • Map: one-to-one
  • Gather: many-to-one
  • Stencil: several-to-one
  • Reduce: all-to-one
  • Scatter: one-to-many
  • Scan: all-to-all
  • Sort: all-to-all

The basic problem of concurrent memory access was illustrated via a scatter example. With one input tying to write a 3rd of its value to the neighboring elements we can see that trouble will result from independent thread execution.

one-to-many memory writes will pose a problem with independent threads writing over the same memory locations
one-to-many memory writes will pose a problem with independent threads writing over the same memory locations

To overcome these barriers threads must communicate in some ways. Shared memory and synchronization points were described as tools for this job.

Some important information about CUDA and what it guarantees (and doesn’t guarantee) about thread execution was also touched on:

  • No guarantees about when and where thread blocks will run –  this is an enabler for massive parallelism. There are also some limitation due to this, no assumptions about what blocks will run on what SM and there can be no direct communication between blocks.
  • Does guarantee – all threads in a block are guaranteed to run on the same SM at the same time.
  • Does guarantee – all blocks in a kernel finish before blocks from the next kernel run

A good source for some of the hardware diagrams and acronyms:

Finally coalescing memory access was discussed as a very good practice for efficiency.

The good. the not so good and the bad types of memory access

Week 2 lectures

Week 2 Assignment

Introduction to Parallel Programming – Week 1

Introduction to parallel programming is the second MOOC course that I signed up for. The emergence of parallel and distributed computing is not slowing down and it seems that most developers are not accustomed to the very different train of though that parallelism invokes. Most recent GPUs have 1024 simple compute units each of which can run parallel threads. The general course over by the course instructor, John Owens:

The first week started off pretty simple focussing on why GPUs, parallel programming and CUDA. I found the pace of the videos just right and much more engaging than other courses I have looked at.

The basic of CUDA:


CUDA programs are controlled by the host CPU and memory and the libraries enable interaction with the GPU/s.
CUDA programs are controlled by the host CPU and memory and the libraries enable interaction with the GPU/s.

Week 1 lectures

Week 1 Assignment
On review the assignment solution is sub-optimal enough to do the job.