Categories
Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 4 (Part 2)

Part 2 of week 4’s lectures delved into Sort algorithms.

Sort is difficult in parallel programming because things get more complex when adding the following objectives:

  • keeping the hardware busy 
  • Limiting branch divergence
  • Coalescing memory access

Odd-Even Sort (brick sort), parallel version of bubble sort:

Start with an array of elements comparing odds and evens with different polarity each iteration. The elements are swapped if they are out of order.

Step complexity: O(n), Work Complexity: O(n^2)

Merge Sort

Divide and conquer approach, ideal for GPUs.

Merging two sort lists together, n sorted, n/2 sorted, n/4 sorted…

Step complexity: O(log n), Work complexity: O(n log n)

General implementations use a different sort to get to 1024 chunk sorted lists.

Merging 2 sorted can be done in parallel, remembering compact.

How parallel merge works:

  1. With 2 sorted arrays launch a thread for every element in each array. The goal of each thread is to calculate the position of its element in the final list.
  2. Thread calculates it position in its own list, its index…
  3. Calculate its position in the other array -> Binary search 🙁 every thread does a binary search on the other array.
The different stages of merge sort require different algorithms to get the most out of hardware
The different stages of merge sort require different algorithms to get the most out of hardware

When we get to the top of the merge tree, allocating one huge task to only 1 streaming multiprocessor [SM] means that there will be however many other SMs we have sitting idle. So we want one merge to be spread across multiple SMs.

How do we make sub tasks? Spliters. The logic does not seem to complex but I will review the section on it below:

 

Moving away from the merge sort, a new approach was explored – Sorting networks. This is a form of oblivious algorithms, ie an algorithm that does the same thing all the time. This enables high level of parallelism.

Bitonic sequences Work complexity: (n log n^2), however step complexity

All scenarios take the same amount of time, the algorithm is oblivious! If input set are small enough to fit into shared memory then sorting networks are generally good performers.

The best performing algorithm on GPUs is Radix sort – the steps for radix sort on integers is:

  1. Start with the least significant bit of the integer
  2. split input into 2 sets, 1s and 0s – (Compact – see part 1)
  3. Proceed to next LSB and repeat’

Step complexity O(kn) where k is bits in representation and n is number of elements to sort.

The final sort algorithms discussed were Quicksort and Key Value sort. Quicksort requires recursion and the control structure is quite difficult to implement.

Week 4 Lectures

Week 4 Assignment

 

Categories
Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 4 (Part 1)

The topic of week 4’s lectures and assignment was Sort and Scan. These are slightly more complex due to the many-to-many, all-to-all communication patterns.

Part 1 focused on Scan.

Important properties of Scan, work complexity: O(n),  step complexity: O(log n).

Variations on the Scan algorithm:

  • Compact/Filter – Gathering a subset that meet a certain criteria (input, predicate, output). Ie only keep input elements where the predicate is true. Outputs can be dense or sparse. Sparse is easy, as elements map to the same locations. Dense will result in contiguous array of the filtered elements (much faster for subsequent reading and writing). 
why is using compact more efficient?
why is using compact more efficient?

In the card deck example where we want the output of only diamonds using compact will be much faster if the computecard() function is above a minimal computational cost. So compact is very useful when the number of elelments filtered out is large and the computation on each surviving element  is high.

Summary of how to Compact:

  1. Predicate
  2. Scan-In Array: 1 / 0
  3. Exclusive sum scan (scatter addresses for the compacted array)
  4. Scatter input input element into output array using the scatter address addresses.

The predicate and scan parts of Compact will have the same run time regardless of the number of elements filtered. If very few items are to be written to the output then the scatter step will be faster.

A brief side not in the lecture introduced Allocate is a generalized version of Compact which can be used for clipping.

Next came Segmented Scan, we want to keep our kernels busy and independent. Segmented scan uses a second array to keep track of segment heads. This allows independent computation of scan values that can be reconciled when there is less work to do.

Sparse Matrix/Dense Vector multiplication [SpMv]  was a practical application for the techniques we had just learned. SpMv is a good practical example as they are common in many problems domains, ie: (Page Ranking by search engines).

Traditional way to represent a spare matrix is a Compressed Sparse Row [CSR] – Value (non-zero data element values), Column (column of elements origin), RowPtr (element index that starts a new row)

Nice and simple example of CSR calculation
Nice and simple example of CSR calculation

 

Now that we have the CSR representation of the sparse matrix we can conduct the multiplication.

  1. Segment the Value array using the  rowptr values as demarkation points
  2. Gather the vector values for each element in the array using the column array
  3. Use a Map function to multiply the value array with the gathered vector array.
  4. Conduct and exclusve segmented sum scan to get the final answer
Example of the steps above
Example of the steps above

Week 4 Lectures

Week 4 Assignment

Categories
Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 3

Unit 3 worked on analyse the performance of CUDA programs along with some key algorithms: Reduce, Scan and Histogram. Parallelism is easiest with one-to-one or many-to-one communication patterns week 3 looked into how we still get great performance gains from parallelism when we need to do all-to-all and many-to-many communication patterns between threads. Just because ideal scaling may not be possible for an algorithm parallelism is likely still going to result in more efficient computing.

Step and work complexity are the first steps to establishing the truthfulness of this claim.

A parallel algorithm will be considered 'work efficient' if its work complexity is asymptotically to same as its serial counterpart
A parallel algorithm will be considered ‘work efficient’ if its work complexity is asymptotically to same as its serial counterpart

Comparing the step and work complexity of parallel algorithms is a good start to understanding their efficiency.

If we can reduce the step complexity in out parallel implementation compared to the serial implementation whilst maintaining work efficency we will have a faster execution.

REDUCE

Reduce has 2 inputs:

  1. Set of elements
  2. Reduction operation (ie: +)

Reduction operators must be binary, operate on two inputs and generate one output.

Reduction operators must be associative, a op b op c == c op b op a

eg: 1 + 2 + 3 == 3 + 2 + 1, whereas 1 – 2 – 3 != 3 – 1 – 2

So, some operators that are both binary and associative are: multiplication, minimum, logical OR, bitwise AND

After running through some examples of reduce including an implementation using shared memory in CUDA we  looked at Scan.

Inclusive vs Exclusive, Hillis Steele vs Blelloch where covered. Finally histograms was covered along with a quick touch on sort.

For the assignment I used a diagram from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html to implement the solution.

Understanding this process is much easier with a diagram!
Understanding this process is much easier with a diagram!

Week 3 Lectures

Week 3 Assignment

Categories
Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 2

After learning the GPU programming model and writing a basic CUDA program Week 2 introduces some concepts for efficient algorithms.

Communication is the first issues; how do threads communicate efficiently. This is easier in some problems, ie: Map than others ie: Scan/Sort. The memory access and write patterns  of some of the key algorithm types are discussed. Input-to-Output relationships:

  • Map: one-to-one
  • Gather: many-to-one
  • Stencil: several-to-one
  • Reduce: all-to-one
  • Scatter: one-to-many
  • Scan: all-to-all
  • Sort: all-to-all

The basic problem of concurrent memory access was illustrated via a scatter example. With one input tying to write a 3rd of its value to the neighboring elements we can see that trouble will result from independent thread execution.

one-to-many memory writes will pose a problem with independent threads writing over the same memory locations
one-to-many memory writes will pose a problem with independent threads writing over the same memory locations

To overcome these barriers threads must communicate in some ways. Shared memory and synchronization points were described as tools for this job.

Some important information about CUDA and what it guarantees (and doesn’t guarantee) about thread execution was also touched on:

  • No guarantees about when and where thread blocks will run –  this is an enabler for massive parallelism. There are also some limitation due to this, no assumptions about what blocks will run on what SM and there can be no direct communication between blocks.
  • Does guarantee – all threads in a block are guaranteed to run on the same SM at the same time.
  • Does guarantee – all blocks in a kernel finish before blocks from the next kernel run

A good source for some of the hardware diagrams and acronyms: www.cs.nyu.edu/manycores/cuda_many_cores.pdf

Finally coalescing memory access was discussed as a very good practice for efficiency.

cuda_coalesced_memory_access
The good. the not so good and the bad types of memory access

Week 2 lectures

Week 2 Assignment

Categories
Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 1

Introduction to parallel programming is the second MOOC course that I signed up for. The emergence of parallel and distributed computing is not slowing down and it seems that most developers are not accustomed to the very different train of though that parallelism invokes. Most recent GPUs have 1024 simple compute units each of which can run parallel threads. The general course over by the course instructor, John Owens:
 

 
The first week started off pretty simple focussing on why GPUs, parallel programming and CUDA. I found the pace of the videos just right and much more engaging than other courses I have looked at.

The basic of CUDA:

 

CUDA programs are controlled by the host CPU and memory and the libraries enable interaction with the GPU/s.
CUDA programs are controlled by the host CPU and memory and the libraries enable interaction with the GPU/s.

Week 1 lectures

Week 1 Assignment
On review the assignment solution is sub-optimal enough to do the job.

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 7

The final week of the course brought together all of the previous concepts and capabilities to leverage lazy evaluation.

Structural Inductions on trees can be conducted in a number of ways. Obviously they are all bound by logic and I was somewhat confused by the level of detail that this topic was covered on. Perhaps because this concept is not readily available in imperative languages.

Streams was the next topic and one that was used extensively in the week 7 assignment.

((1000 to 10000) filter isPrime)(1)

Above is very inefficient as it finds all prime numbers between 1000 and 10,000, whilst only using element 0 and 1 from the results.

A good solution is to avoid evaluating all of the numbers in the list from 1000 to 10,000 is the .toStream function.

((1000 to 10000).toSteam filter isPrime)(1)
Example of a stream implementation in Scala
Example of a stream implementation in Scala

Lazy evaluation was demonstrated next, its importance paralleled to that of streams

Week 7 lectures

Week 7 assignment source – first pass, lots of problems

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 6

Week 6 was the second last week in the course and focused on Collections. For collections that are not generally going to have linear type access, Vectors are a more apt collection.

Determining the right collection type to use is not so difficult when they are arranged logically.
Determining the right collection type to use is not so difficult when they are arranged logically.

Using the Sequence classes allows the inheritance on useful operations:

scala_sequence_operations
Sequence sub classes inherit some useful operations

Week 6 Lectures

Week 6 assignment source (again this was a quick and dirty job and is far from complete marks)

 

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 5

Week 5 – Lists, investigate a ‘fundamental’ data structure in functional programming. The recursive nature of lists in Scala makes them quite different from arrays. This tied in with the power of pattern matching and the recursive nature of function programming gives credibility to the ‘fundamental’ label.

Pattern matching on lists is a powerful tool
Pattern matching on lists is a powerful tool

Sorting of lists was covered in good detail and in addition to list1.head and list1.tail more functions were revealed:

scala_list_functions
Some key functions for lists

Week 5 Lectures

Week 4 and Week 6 assignments were considered higher work load than others thus there was no assignment for week 5

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 4

Week 4 continued to delve into polymorphism with a focus on Types and Pattern Matching. Generalizing functions to accept multiple parameter types is a clear highlight of Scala.

scala_polymorphism
generic functions in scala

Pattern matching was a topic of week 4 that was used in every subsequent week. Case statement with pattern matching appears to be another staple of Scala.

Number(1) match {
  case Number(n) => n
  case Sum(e1, e2) => eval(e1) + eval(e2)
} + eval(Number(2))

Week 4 lectures

Week 4’s assignment of implementing the Huffman coding compression algorithm was a good practical example of Scala’s power. Note my source has a number of errors in it.

Week 4 Assignment source

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 3

Week 3 looked at the object oriented aspects of Scala. The evaluation of encapsulated data was the most interesting part of week 3,

The lectures went into details about some more of the ‘syntactic sugar’ of Scala.

Abstraction of classes in Scala was explained clearly in the lectures too.

The assignment for week 3 demonstrated Scala’s ability to implement sort, filtering, and union operations in just a few lines.

Polymorphism in Scala was also described. The use of types and traits was shown to enabled different ‘flavors’ of functions.

Week 3 lectures

Week 3 assignment source

scala_class_heirarchy
Class heirarchy in Scala