Online Courses Scalable Microservices with Kubernetes

Intro to Kubernetes

OK- now we are getting to the interesting stuff. Given we have a microservices architecture using Docker, how do we effectively operate our service. The services must include production environments, testing, monitoring, scaling etc.

 Problems/Challenges with microservices – organisational structure, automation requirements, discovery requirements.

We have seen how to package up a single service but that is a small part of the operating microservices problem. Kubernetes is suggested as a solution for:

  • App configuration
  • Service Discovery
  • Managing updates/Deployments
  • Monitoring

Create a cluster (ie: CoreOS cluster) and treat is as a single machine.

Into a practical example.

# Initate kubernetes cluster on GCE
gcloud container clusters create k0
# Launch a single instance
kubectl run nginx --image=nginx:1.10.0
# List pods
kubectl get pods
# Expose nginx to the world via a load balancer provisioned by GCE
kubectl expose deployment nginx --port 80 --type LoadBalancer
# List services
kubectl get services

Kubernetes cheat sheet

Next was a discussion of the Kubernetes components:

  • Pods (Containers, volumes, namespace, single ip)
  • Monitoring, readiness/health checks
  • Configmaps and Secrets
  • Services
  • Lables

Creating secrets:

# Initate kubernetes cluster on GCE
# create secrets for all files in dir
kubectl create secret generic tls-certs --from-file=tls/
# describe secrets you have just created
kubectl describe secrets tls-certs
# create a configmap
kubectl create configmap nginx-proxy-conf --from-file=nginx/proxy.conf
# describe the configmap just created
kubectl describe configmap nginx-proxy-conf

Now that we have our tls-secrets and nginx-proxy-conf defined in the kubernetes cluster, they must be exposed to the correct pods. This is accomplished within the pod yaml definition:

    - name: "tls-certs"
        secretName: "tls-certs"
    - name: "nginx-proxy-conf"
        name: "nginx-proxy-conf"
          - key: "proxy.conf"
            path: "proxy.conf"

In production you will want expose pods using services. Sevices are a persistent endpoint for pods. If pods has a specific label then they will automatically be added to the correct service pool when confirmed alive. There are currently 3 service types:

    • cluster ip – internal only
    • NodePort – each node gets an external ip that is accessible
    • LoadBalance – A load balancer from the cloud service provider (GCE and AWS(?) only)

Accessing a service using NodePort:

# Initate kubernetes cluster on GCE
# create a service
kubectl create -f ./services/monolith.yaml
kind: Service
apiVersion: v1
  name: "monolith"
    app: "monolith"
    secure: "enabled"
    - protocol: "TCP"
      port: 443
      targetPort: 443
      nodePort: 31000
  type: NodePort
# open the nodePort port to the world on all cluster nodes
gcloud compute firewall-rules create allow-monolith-nodeport --allow=tcp:31000
# list external ip of compute nodes
gcloud compute instances list
NAME                               ZONE          MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP      STATUS
gke-k0-default-pool-0bcbb955-32j6  asia-east1-c  n1-standard-1       RUNNING
gke-k0-default-pool-0bcbb955-7ebn  asia-east1-c  n1-standard-1        RUNNING
gke-k0-default-pool-0bcbb955-h7ss  asia-east1-c  n1-standard-1        RUNNING

Now any request to those EXTERNAL_IPs on port 31000 will be routed to pods that have label “app=monolith,secure=enabled” (as defined in the service yaml)

# get pods meeting service label definition
kubectl get pods -l "app=monolith,secure=enabled"
kubectl describe pods secure-monolith

Okay – so that, like the unguided demo I worked through previously was very light on. I am still not clear on how I would many a microservices application using the kubernetes tool. How do I do deployments, how to I monitor and alert, how do I load balance (if not in google cloud), how to I do service discovery/enrollment. Theres one more lesson to go in the course, so hopefully “Deploying Microservices” is more illuminating.

Intro to DevOps Online Courses

Intro to DevOps – Lesson 3


  • Continuous integration/Delivery (Jenkins)
    • Automate from commits to repo to build to test to deploy
  • Monitoring (Graphite)

At minimum there will be 6 environments

  1. local (dev workstations)
  2. dev (sandbox)
  3. integration (test build and side effects)
  4. test (UAT, Performance, QA may be many environments)
  5. staging (live data? – replication of production)
  6. production

From coding to prod

If the hand over of each of these steps was manuals there are too many opportunities for delays and errors. Sooo:

  • Continuous integration
    • Maintain a code repository (git)
    • Automate the build (Jenkins, TravisCI, CircleCI)
    • Test the build (Jenkins, TravisCI, CircleCI)
    • Commit changes often (manual)
    • Build each commit (Jenkins)
    • Fix bugs immediately (manual)
    • Test in a clone environment (test suites)

Some practical work setting up Jenkins…


Intro to DevOps Online Courses

Intro to DevOps – Lesson 2


  • How to get Dev and ITOps working together
  • Looking at some tools to enable that integration

Started of looking at the basic conflict between ITOps and Dev. In essence the risk adversity of ITOps, why its there and why its good and bad. First hand experience of this with being woken up many nights after a ‘great’ new release makes this quite pertinent.

  • ITOps needs to run systems that are tested and tightly controlled – This means that when a release is coming that requires new or significantly changed components ITOps needs to be included in discussing and aware so they can ensure stability in Production
  • Dev needs to adopts, trial and use new technologies to create software solutions that meet user and business requirements in a effective manner
  • Performance testing needs to be conducted throughout the development iterations and this is impossible if the development environments are significantly different to production
These improvements would make remove most of the real world issues we experience when conducting release deployments.

Performance tests:

How can performance tests be conducted throughout the development of new releases, particularly if these release become more regular?

Proposed answer 1 – is a ‘Golden Image’ – A set image that is used for developing, testing and operating services. This includes Apps, Libs and OS. Docker makes this more practical with containers.

Proposed answer 2 – is to apply configuration management to all machines (not sure how practical this could be).

Practical lab:

Installed Virtualbox, Vagrant, git, ssh, Packer.

Vagrant configures VMs, Packer enabled building of ‘golden images’.

Packer template Variables:

  • Builders take source image.
  • Provisions install and configure software within running machine (shell, chef and puppet scripts).
  • Post processors conduct tasks on images output by builders ie, compress ( These post processors can create VMs for AWS, DigitalOcean, Hyper-V, Parallels, QEMU, VirtualBox, VMware

Lab instructions were pretty straight forward. On to lesson 3.

Intro to DevOps

Intro to DevOps – Lesson 1

With all of the emerging technology solutions and paradigms emerging in the IT space, it can be difficult to get a full understanding of everything; particularly before developing biases. So… from the perspective of an infosec and ops guy I will list out some notes on my own review of current direction of devops. This review is based primarily on Udacity – Intro to DevOps and assorted blogs.

Why do it?

Reduce wastage in software development and operation workflows. Simply, more value, less pain.

What is is?

Most of the definitions out there boil down to communication and collaboration between Developers, QA and IT Ops throughout all stages of the development lifecycle.

  • No more passing the release from Dev to IT Ops
  • No more clear boundaries between Dev and IT Ops people/environments/processes and tools
  • No more inconsistency between Dev and Prod environments
  • No more deciding who’s problem bugs are
  • No more 7 day release deployments
  • No more separate tool sets


Agile development + Continuous Monitoring + Delivery + Automation + Feedback loops =  DevOps?

  • Create shared view on goals, responsibilities, priorities and benefits
  • Learn from failures (feedback mechanisms include devs and operators)
  • Reduce risk and size of changes
  • Drive automation
  • Drive feedback loops
  • Validate ideas as quickly and cheaply (cost + risk) as possible

What DevOps is not:

  • Developers overtaking operations
  • Just tools (though it really is enabled and perhaps dependent on tools)

How do you apply it?

CAMS – Culture, Automation, Measurement and Sharing

  • Culture -> Agile like (People>Process>Tools) + Lean (don’t do what’s not valuable)
  • Automation -> Deployment, Unit Testing, CI -> These come together with DevOps?
  • Measurement -> Infrastructure, usage, release, performance, business metrics, processes, trends
  • Sharing -> Without the functional separation, feedback loops are tighter; particularly between code and operate

What technologies enable it?

Coming in next lesson – good resource for tools –

Other thoughts

Not much so far – looking forward to testing some tools. Particularly how patching and vulnerability management can be applied to docker images.



Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 5 Part 2

Week 5 -> Optimizing GPU Programs (continued)

Goal: Maximize useful computation/second

After walking through APOD [Analyze, parallelize, optimize, deploy] the lecture turned to memory bandwidth. Using the CUDA utility deviceQuery to calculate memory bandwidth using memory clock rate and memory bus width.

After determining the maximum theoretical bandwidth at 40 Gb/s practical goals were set:

  • 40-60% -> OK
  • 60-75% -> good
  • >75% -> excellent
Using 12.5 GB/s of memory bandwidth is under utilizing!
Using 12.5 GB/s of memory bandwidth is under utilizing!

The algorithm being analyzed is a transpose. When DRAM utilization is low the first guess should be lack of coalescing.

In our example code we find that the reading of memory is well coalesced.. but the writing phase is strided by N elements (1024). This was described as ‘bad’.

So considering that most GPU codes are memory limited checking memory bandwidth utilization is very important.

Enter a tool – nSightm nvpp [nvidia visual profiler]  – confirms that the write to memory operations are utilizing very little memory bandwidth whilst the read  operations are at 100% utilization.

The solution for this algorithm is ’tiling’. Tiling utilizes shared memory, taking a tile of the input copying and transposing into output. The code for this solution can be found here:

Occupancy – Each SM has a limited number of:

  • Thread blocks -> 8
  • Threadss -> 1536/2048
  • registers for all threads -> 65536
  • bytes of shared memory 16K-48K

Maximizing number of threads on the SM [streaming multi processor] will maximize occupancy. The limits for specific hardware can be found via deviceQuery.

The transpose code was further optimized, better versions can be seen in the link above.

Shared memory bank conflicts ->organized into banks and depending on how threads access memory in tile, replays of shared memory accesses can occur. (ie: striping shared memory usage across banks).

Referring back to the goal of maximizing effective computation we have just address one factor:

  • minimize time waiting at barriers

The next to be address was:

  • minimize thread divergence

Some important definitions:

  • warp – set of threads that execute the same instruction at a time
  • SIMD – Single instruction, multiple data (CPU
  • SIMT – Single instruction, multiple thread

Thread divergence is can result in up to 32x slower code. Warps on nvidia hardware have 32 threads which apply single instruction to multiple threads.

Next topic was Streams, launching kernels in separate streams allows for concurrent execution:

Benefits of specifying CUDA streams

To create streams:

cudaSteam_t s1;



asynchronous memory transfer  – cudaMemcpyAsync – called on pinned memory.

Week 5 lectures

Week 5 code – failed to get shared memory working even with atomicAdd 🙁

Functional Programming - Scala Random

Virtual Enigma Machine Implementation

After watching a video on numberphile thought it would be interesting to implement a virtual version on the Enigma cipher machine.

Step 1: Install Scala and Eclipse IDE for Scala.

Step 2: Review the numberphile video and how the Enigma machine works

Step 3: Implement each function of the machine

  1. Rotors
    • Rotor choices, rotors have different circuitry (0-24) – static value for each message
    • Initial rotor combination (0-25) – static value for each message
    • Rotor increment – Faster rotor, middle rotor, slow rotor
  2. Plug board
    • 10 wires
    • swapping 2 letters
    • static value for each message

Step 4: Write an interface between the functions and a basic interface (text?)

Step 5: Test!

Conclusion: Using Scala (trying anyway) the implementation is not very complex. What is more difficult is replicating the original machines where the output character could not equal the input character. The limitation Turing used to crack the machines. Might see if I can implement that later then test out the cracking method. Not sure how easy it would be to break the current implementation…

CODE (EnigmaMachine.scala, EnigmaMachine.class):



// Virtual reconstruction of what the enigma machines did
// The replication is of the improved version where the input could result in the same output

object EnigmaMachine {
//////////////// ENCRYPTION/DECRYPTION FUNCTIONS \\\\\\\\\\\\\\\\\\\\\
    //to replicate the simple machine circuit chars are converter to 0 - 25
    def convertMessageToSimpleInt(rawInput: String):List[Int] = {
        val userMessage = rawInput.toUpperCase
      (for (x <- userMessage) yield if (x != 32) (x.toInt - 65) else 26).toList
    //for all integers in the list, convert
    def convertMessageToString(input: List[Int]): String = {
      (for (x <- input) yield if (x == 26) " " else (x + 65).toChar).mkString
    def enigmaGo(rotorChoices: List[Int], rotorPositions: List[Int], inputMessage: List[Int], encrypt: Boolean): List[Int] = {    
      // incrementing the rotors is dont based on the number of previous chars.. instead of the previous position
      def incrementRots(nth: Int): List[Int] = {
        val rot0 = (rotorPositions(0) + nth) % 26 //fast rotor
        val rot1 = (rotorPositions(1) + (nth / 26)) % 26 //medium rotor
        val rot2 = (rotorPositions(2) + ((nth / 26) /26)) % 26 //slow rotor
        List(rot0, rot1, rot2)
      //applies the rotor encryption then the scramble board
      def transmuteElement(element: Int, rotPosApp: List[Int]): Int = {
        val rotorTotal = rotorList(rotorChoices(0))(rotPosApp(0)) +
              rotorList(rotorChoices(1))(rotPosApp(1)) +
      def scramble(inputVal:Int):Int = plugMap1.find(x => (x._1 == inputVal) || (x._2 == inputVal)) match {
            case Some(y) => if (y._1 == inputVal) y._2 else y._1 
          case None => inputVal
        if (element == 26) element
        else if (encrypt) (scramble(element) + rotorTotal) % 26 else { 
            //decrypt is a bit messy atm :( 
            if (element - (rotorTotal % 26) < 0) scramble(element - (rotorTotal % 26) + 26) else scramble(element - (rotorTotal % 26))
      //Accumulator hold the input and output lists, when input list is empty, return output.. also tail recursive
      def enigmaAccu(remInput: List[Int], rotPos: List[Int], accumOutput: List[Int]): List[Int] = {
              if (remInput.isEmpty) accumOutput
            else enigmaAccu(remInput.tail, incrementRots(accumOutput.length), accumOutput:+(transmuteElement(remInput.head, rotPos)))
      enigmaAccu(inputMessage, rotorPositions, Nil) //start the process
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\
//////////////// UI STUFF \\\\\\\\\\\\\\\\\\\\\
    def useMachine(rotorChoices: List[Int], rotorStartVals: List[Int], activity: String) = {
            val inputMessage = convertMessageToSimpleInt(askInput("Please enter a message to be " + activity + ", use only a-z: "))
            val resultingMessage = enigmaGo(rotorChoices, rotorStartVals, inputMessage, if (activity == "encrypted") true else false)
            println("Your message has been " + activity + ": ")

    def getRotorChoices(s1:String, p: String => Boolean):List[Int] = {
        val slot0 = getUserInput(s1 + "fast slot: ", "Entry Invalid.", p).mkString
        val slot1 = getUserInput(s1 + "medium slot: ", "Entry Invalid.", p).mkString
        val slot2 = getUserInput(s1 + "slow slot: ", "Entry Invalid.", p).mkString
    def spamLine = println("##############################################")
    def askInput(statement: String) = { 
      val input = readLine(statement)
      if (input == "") "99" else input
    def askAgain(statement: String, f: => String) = {
    def inputVals(f1: => String, f2: => String) = Stream.cons( f1, Stream.continually(f2))
    def getUserInput(s1: String, s2: String, p:String => Boolean) = inputVals(askInput(s1), askAgain(s2, askInput(s1))).find(p)
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\
//////////////// MAIN \\\\\\\\\\\\\\\\\\\\\ 
        def main(args: Array[String]) {
            println("### WELCOME TO THE ENIGMA MACHINE! ###" )
            val rotorChoices = getRotorChoices("Choose rotor (0 - 4) for ", x => (x.toInt >= 0 && x.toInt < 5)) 
            // ROTOR START VALS 
            val rotorStartVals =  getRotorChoices("Choose rotor START VAL (0 - 25) for ", x => (x.toInt >= 0 && x.toInt < 25))
            // PLUG MAP
            println("There is only 1 plug map option at present:") 
            //ENCRYPT OF DECRYPT - ONLY RELEVENT FOR GUI TEXT.. bleh not quite need to reverse circuit manually
            val userChoice = getUserInput("Press 1 to encrypt, 2 to decrypt: ", "Entry Invalid.", x => (x == "2" || x == "1"))
            userChoice.mkString match {
              case "1" => useMachine(rotorChoices, rotorStartVals, "encrypted") 
              case "2" => useMachine(rotorChoices, rotorStartVals, "decrypted")
              case default => exit(0)
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\

//////////////// CONSTANTS \\\\\\\\\\\\\\\\\\\\\
    val rotorList = List(List(6, 17, 4, 3, 23, 22, 21, 5, 0, 19, 20, 11, 16, 9, 13, 24, 18, 1, 8, 2, 7, 10, 12, 14, 25, 15)
    ,List(2, 7, 23, 12, 13, 15, 18, 1, 8, 11, 0, 3, 10, 5, 24, 4, 17, 19, 6, 9, 20, 14, 25, 21, 16, 22)
    ,List(23, 1, 19, 18, 0, 10, 3, 8, 9, 21, 2, 12, 13, 14, 24, 6, 20, 16, 11, 17, 15, 7, 4, 25, 22, 5)
    ,List(21, 22, 16, 2, 18, 20, 15, 9, 5, 1, 17, 10, 3, 6, 0, 13, 11, 23, 4, 12, 19, 8, 7, 25, 24, 14)
    ,List(25, 3, 11, 7, 5, 15, 24, 10, 21, 6, 12, 17, 8, 23, 2, 18, 19, 14, 4, 0, 20, 9, 22, 1, 16, 13))
    val plugMap1 = List((3,7),(11,16),(18,24),(17,9),(21,19),(14,13),(0,12),(1,25),(22,20),(4,6))
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\
Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 5 Part 1

Week 5 -> Optimizing GPU Programs.

Parallelizing and porting programs to run on CUDA is generally done to either solve bigger problems or to solve more problems. So optimizing programs to require less time may be beneficial.

It is important to note that optimization should be completed with reference to the goals of the program and the execution time of each part of the program. Once a function is no longer a performance bottle neck the returns on further optimization are likely to be diminished.

– decrease time spent on memory operations
– coalesce global memory access
– avoid thread divergence

These basic principles do have exceptions. For instance, the transfer of data from global to shared memory may increase time on memory operations but decrease overall execution time.

The lecture highlighted some types of optimization:

  1. Selecting the right algorithm -> Likely to have the largest impact
  2. Applying the basic principles for efficiency
  3. Architecture specific optimizations
  4. Micro optimizations (instruction level)

A methodology for the development process of parallel applications was subsequently suggested:

  1. Analyze -> Profile the applications identifying bottlenecks/hotspots
  2. Parallelize -> using approaches such as libraries/OpenMP/CUDA, also selecting algorithms
  3. Optimize -> Measurement focus
  4. Deploy -> Optimization should not be completed in a vacuum it is too difficult to predict and emulate real usage


A simple working example on optimizing the transposing matrices followed.

Timing the function from a standard serial implementation to moderately parallel example and finally implementing our own fully parallel code. The code for the example:

Focusing too much on a single function will generally yield diminishing returns
Focusing too much on a single function will generally yield diminishing returns


Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 4 (Part 2)

Part 2 of week 4’s lectures delved into Sort algorithms.

Sort is difficult in parallel programming because things get more complex when adding the following objectives:

  • keeping the hardware busy 
  • Limiting branch divergence
  • Coalescing memory access

Odd-Even Sort (brick sort), parallel version of bubble sort:

Start with an array of elements comparing odds and evens with different polarity each iteration. The elements are swapped if they are out of order.

Step complexity: O(n), Work Complexity: O(n^2)

Merge Sort

Divide and conquer approach, ideal for GPUs.

Merging two sort lists together, n sorted, n/2 sorted, n/4 sorted…

Step complexity: O(log n), Work complexity: O(n log n)

General implementations use a different sort to get to 1024 chunk sorted lists.

Merging 2 sorted can be done in parallel, remembering compact.

How parallel merge works:

  1. With 2 sorted arrays launch a thread for every element in each array. The goal of each thread is to calculate the position of its element in the final list.
  2. Thread calculates it position in its own list, its index…
  3. Calculate its position in the other array -> Binary search 🙁 every thread does a binary search on the other array.
The different stages of merge sort require different algorithms to get the most out of hardware
The different stages of merge sort require different algorithms to get the most out of hardware

When we get to the top of the merge tree, allocating one huge task to only 1 streaming multiprocessor [SM] means that there will be however many other SMs we have sitting idle. So we want one merge to be spread across multiple SMs.

How do we make sub tasks? Spliters. The logic does not seem to complex but I will review the section on it below:


Moving away from the merge sort, a new approach was explored – Sorting networks. This is a form of oblivious algorithms, ie an algorithm that does the same thing all the time. This enables high level of parallelism.

Bitonic sequences Work complexity: (n log n^2), however step complexity

All scenarios take the same amount of time, the algorithm is oblivious! If input set are small enough to fit into shared memory then sorting networks are generally good performers.

The best performing algorithm on GPUs is Radix sort – the steps for radix sort on integers is:

  1. Start with the least significant bit of the integer
  2. split input into 2 sets, 1s and 0s – (Compact – see part 1)
  3. Proceed to next LSB and repeat’

Step complexity O(kn) where k is bits in representation and n is number of elements to sort.

The final sort algorithms discussed were Quicksort and Key Value sort. Quicksort requires recursion and the control structure is quite difficult to implement.

Week 4 Lectures

Week 4 Assignment


Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 4 (Part 1)

The topic of week 4’s lectures and assignment was Sort and Scan. These are slightly more complex due to the many-to-many, all-to-all communication patterns.

Part 1 focused on Scan.

Important properties of Scan, work complexity: O(n),  step complexity: O(log n).

Variations on the Scan algorithm:

  • Compact/Filter – Gathering a subset that meet a certain criteria (input, predicate, output). Ie only keep input elements where the predicate is true. Outputs can be dense or sparse. Sparse is easy, as elements map to the same locations. Dense will result in contiguous array of the filtered elements (much faster for subsequent reading and writing). 
why is using compact more efficient?
why is using compact more efficient?

In the card deck example where we want the output of only diamonds using compact will be much faster if the computecard() function is above a minimal computational cost. So compact is very useful when the number of elelments filtered out is large and the computation on each surviving element  is high.

Summary of how to Compact:

  1. Predicate
  2. Scan-In Array: 1 / 0
  3. Exclusive sum scan (scatter addresses for the compacted array)
  4. Scatter input input element into output array using the scatter address addresses.

The predicate and scan parts of Compact will have the same run time regardless of the number of elements filtered. If very few items are to be written to the output then the scatter step will be faster.

A brief side not in the lecture introduced Allocate is a generalized version of Compact which can be used for clipping.

Next came Segmented Scan, we want to keep our kernels busy and independent. Segmented scan uses a second array to keep track of segment heads. This allows independent computation of scan values that can be reconciled when there is less work to do.

Sparse Matrix/Dense Vector multiplication [SpMv]  was a practical application for the techniques we had just learned. SpMv is a good practical example as they are common in many problems domains, ie: (Page Ranking by search engines).

Traditional way to represent a spare matrix is a Compressed Sparse Row [CSR] – Value (non-zero data element values), Column (column of elements origin), RowPtr (element index that starts a new row)

Nice and simple example of CSR calculation
Nice and simple example of CSR calculation


Now that we have the CSR representation of the sparse matrix we can conduct the multiplication.

  1. Segment the Value array using the  rowptr values as demarkation points
  2. Gather the vector values for each element in the array using the column array
  3. Use a Map function to multiply the value array with the gathered vector array.
  4. Conduct and exclusve segmented sum scan to get the final answer
Example of the steps above
Example of the steps above

Week 4 Lectures

Week 4 Assignment

Introduction to Parallel Programming Online Courses

Introduction to Parallel Programming – Week 3

Unit 3 worked on analyse the performance of CUDA programs along with some key algorithms: Reduce, Scan and Histogram. Parallelism is easiest with one-to-one or many-to-one communication patterns week 3 looked into how we still get great performance gains from parallelism when we need to do all-to-all and many-to-many communication patterns between threads. Just because ideal scaling may not be possible for an algorithm parallelism is likely still going to result in more efficient computing.

Step and work complexity are the first steps to establishing the truthfulness of this claim.

A parallel algorithm will be considered 'work efficient' if its work complexity is asymptotically to same as its serial counterpart
A parallel algorithm will be considered ‘work efficient’ if its work complexity is asymptotically to same as its serial counterpart

Comparing the step and work complexity of parallel algorithms is a good start to understanding their efficiency.

If we can reduce the step complexity in out parallel implementation compared to the serial implementation whilst maintaining work efficency we will have a faster execution.


Reduce has 2 inputs:

  1. Set of elements
  2. Reduction operation (ie: +)

Reduction operators must be binary, operate on two inputs and generate one output.

Reduction operators must be associative, a op b op c == c op b op a

eg: 1 + 2 + 3 == 3 + 2 + 1, whereas 1 – 2 – 3 != 3 – 1 – 2

So, some operators that are both binary and associative are: multiplication, minimum, logical OR, bitwise AND

After running through some examples of reduce including an implementation using shared memory in CUDA we  looked at Scan.

Inclusive vs Exclusive, Hillis Steele vs Blelloch where covered. Finally histograms was covered along with a quick touch on sort.

For the assignment I used a diagram from to implement the solution.

Understanding this process is much easier with a diagram!
Understanding this process is much easier with a diagram!

Week 3 Lectures

Week 3 Assignment