Unit 3 worked on analyse the performance of CUDA programs along with some key algorithms: Reduce, Scan and Histogram. Parallelism is easiest with one-to-one or many-to-one communication patterns week 3 looked into how we still get great performance gains from parallelism when we need to do all-to-all and many-to-many communication patterns between threads. Just because ideal scaling may not be possible for an algorithm parallelism is likely still going to result in more efficient computing.
Step and work complexity are the first steps to establishing the truthfulness of this claim.
Comparing the step and work complexity of parallel algorithms is a good start to understanding their efficiency.
If we can reduce the step complexity in out parallel implementation compared to the serial implementation whilst maintaining work efficency we will have a faster execution.
Reduce has 2 inputs:
- Set of elements
- Reduction operation (ie: +)
Reduction operators must be binary, operate on two inputs and generate one output.
Reduction operators must be associative, a op b op c == c op b op a
eg: 1 + 2 + 3 == 3 + 2 + 1, whereas 1 – 2 – 3 != 3 – 1 – 2
So, some operators that are both binary and associative are: multiplication, minimum, logical OR, bitwise AND
After running through some examples of reduce including an implementation using shared memory in CUDA we looked at Scan.
Inclusive vs Exclusive, Hillis Steele vs Blelloch where covered. Finally histograms was covered along with a quick touch on sort.
For the assignment I used a diagram from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html to implement the solution.