Part 2 of week 4’s lectures delved into Sort algorithms.

Sort is difficult in parallel programming because things get more complex when adding the following objectives:

  • keeping the hardware busy 
  • Limiting branch divergence
  • Coalescing memory access

Odd-Even Sort (brick sort), parallel version of bubble sort:

Start with an array of elements comparing odds and evens with different polarity each iteration. The elements are swapped if they are out of order.

Step complexity: O(n), Work Complexity: O(n^2)

Merge Sort

Divide and conquer approach, ideal for GPUs.

Merging two sort lists together, n sorted, n/2 sorted, n/4 sorted…

Step complexity: O(log n), Work complexity: O(n log n)

General implementations use a different sort to get to 1024 chunk sorted lists.

Merging 2 sorted can be done in parallel, remembering compact.

How parallel merge works:

  1. With 2 sorted arrays launch a thread for every element in each array. The goal of each thread is to calculate the position of its element in the final list.
  2. Thread calculates it position in its own list, its index…
  3. Calculate its position in the other array -> Binary search 🙁 every thread does a binary search on the other array.
The different stages of merge sort require different algorithms to get the most out of hardware
The different stages of merge sort require different algorithms to get the most out of hardware

When we get to the top of the merge tree, allocating one huge task to only 1 streaming multiprocessor [SM] means that there will be however many other SMs we have sitting idle. So we want one merge to be spread across multiple SMs.

How do we make sub tasks? Spliters. The logic does not seem to complex but I will review the section on it below:


Moving away from the merge sort, a new approach was explored – Sorting networks. This is a form of oblivious algorithms, ie an algorithm that does the same thing all the time. This enables high level of parallelism.

Bitonic sequences Work complexity: (n log n^2), however step complexity

All scenarios take the same amount of time, the algorithm is oblivious! If input set are small enough to fit into shared memory then sorting networks are generally good performers.

The best performing algorithm on GPUs is Radix sort – the steps for radix sort on integers is:

  1. Start with the least significant bit of the integer
  2. split input into 2 sets, 1s and 0s – (Compact – see part 1)
  3. Proceed to next LSB and repeat’

Step complexity O(kn) where k is bits in representation and n is number of elements to sort.

The final sort algorithms discussed were Quicksort and Key Value sort. Quicksort requires recursion and the control structure is quite difficult to implement.

Week 4 Lectures

Week 4 Assignment