Categories
Functional Programming - Scala

Functional Programming in Scala – Week 2

Week 2 delved more into tail recursion but more broadly review higher order functions. Looking at perhaps the key characteristic of functional programming languages we found that functions are treated as first class values. Meaning like any other values, functions can be passed as a parameter and returned as a results. This becomes important later in the course and in real world applications where the resolution of values can be deferred until the optimal time.

Another key point of functional programming was introduced, immutable values.  How immutable values related to distributed and parallel programming was touched on briefly. Odersky elaborates on this in the following presentation:

 

The concept of currying was also explained. In essence the passing of functions to functions in the interest of simplifying and generalizing code. These concepts and some of the different function types are details that have not stuck with me very well since the course. I guess that happens when you don’t write a lot of code in a language then leave if for e few months.

Week 2’s assignment was relatively straight forward and followed the contents of the weeks lectures.

Week 2 Assignment code

Week 2 Lectures 

 

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 1

Started and completed this course in the second half of 2012. Thought revisiting the material and uploading the weekly assignments would be a good idea. Week 1 looked at basic functions and evaluations. The recursive nature of functional programming was alluded to, particularly in the assignment.

The basics of (x <- 0 to y) and other scala language specifics such as scoping and structure can all be reviewed in the weeks source code.

I signed up for this course after watching a presentation by Rich Hickey, the creator of Clojure (another functional language for the JVM).

http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey

weeks howework

Week 1 lecture videos: https://class.coursera.org/progfun-2012-001/lecture/8

Once of the most important concepts I took from week 1 was that of tail recursion:

  /**
   * Exercise 1
   */
  def pascal(c: Int, r: Int): Int = {
		//recursive function	
		def tFactorial(number: Int) : Int = {
		//Calculate factorial with tail recursion
  		def tfactorialWithAccumulator(accumulator: Int, number: Int) : Int = {
      	if (number == 1) accumulator
      	else tfactorialWithAccumulator(accumulator * number, number - 1)
			}
		    //start from the start!
        tfactorialWithAccumulator(1, number)
		}

	// element value is calulated using r!/(c!(r-c)!)
	if (c == 0 || c == r || r == 0 || r == 1) 1
	else		
		tFactorial(r) / (tFactorial(c) * tFactorial(r-c))
	}
Categories
ITOps

Scala tail recursion vs non-tail receursion performance testing with dynaTrace

The scala compiler has partial tail call optimization (http://blog.richdougherty.com/2009/04/tail-calls-tailrec-and-trampolines.html).

Running some like methods with and without allowing the scala compiler to optimize demonstrated the performance improvements gained by this optimization.

First up, simple factoral functions (source: http://myadventuresincoding.wordpress.com/2010/06/20/tail-recursion-in-scala-a-simple-example/):

//this function will be tail recusive optimized
def tFactorial(number: Long) : Long = {
 def tfactorialWithAccumulator(accumulator: Long, number: Long) : Long = {
 if (number == 1) 
 return accumulator
 else
 tfactorialWithAccumulator(accumulator * number, number - 1)
 }
 tfactorialWithAccumulator(1, number)
}
//Non-tail recursive function
def ntFactorial(number:Long) : Long = {
 if (number == 1)
 return 1
 number * ntFactorial (number - 1)
}

For explanation of these functions and why/not they are tail recursive see the source above.

Results:

Non-Tail recursive average response time: approx 7 ms

Tail recursive average response time: approx 0.7ms

The number of test conducted was limited to about 10 each and the test environment was an Ubuntu VM. The results will not be highly accurate but with a difference this significant it is clear that tail optimized recursion is very different from normal recursion (On a JVM). Looking at the purepath for each method reveals tail-recursion optimization working as expected and saving alot of execution time:

tail-recursion
Comparing the two purepaths, the 10x increase in response time is clearly caused by the lack of tail recursion on the left.

The next example looks at what the compiler does if there is extra code within a tail optimized funtion:

// Tail recursive version 
 def execTest3Tail(loopCount: Int): String = {
 def test2WithAccumulator(accumulator: Int, number: Int): String = {
 println("Hello World " + accumulator + "\n")
 if (accumulator >= loopCount)
 return "Completed"
 else
 test2WithAccumulator(loopCount, (accumulator + 1))
 }
 test2WithAccumulator(1, loopCount)
 }
// Non-Tail recursive version
 def execTest3nt(loopCount: Int, cur: Int): String = {
 println("Hello World " + cur + "\n") 
 if (cur >= loopCount)
 return "Completed"
 else
 execTest3nt(loopCount, (cur + 1))
 }

The output of the non-tail recursive function was as expected, printing all of the output expected. The scala compiler optimized over the expected (perhaps incorrectly expected) behaviour of the optimized function. The output was:

enter num of iterations:1000
Hello World 1
Hello World 1000
Completed

As the function was optimized not to run through all of the loops – the println’s simply did not occur. The purepath comparision:

tail-recursion2
Results are similar to the first test. The right (tail-optimized) purepath call hierarchy shows why there were only 2 screen outputs..

 

Categories
ITOps

Scala vs Java performance – dynaTrace Analysis – Test 1 Fibonacci Sequence cont’

After some initial JVM monitoring yesterday, I looked at response times today. First thing I had to do was modify the respective classes slightly to get a non-constructor method doing all the work. This made it much easier to gather and chart the results within dynaTrace (though I did later find that there was a check box in dynaTrace to monitor constructors).

The tests today were again on 3 implementations of the Fibonacci sequence:

Java – see source

Scala Simple – see source

Scala Advanced – see source

Test environment info

I spent most of the day playing with dynaTraces monitoring and reporting tools so did not actually do that much testing. The tests I did run where quite short. The first being execution of the fib sequence returning the 10,000 value, the second stretching to 100,000. The charts below show the time results for each test, if more than one test was executed for an implementation in a 30 second cycle, the chart shows the average.

*Update 16-JUL: Retested on host machine (not running in VM) results where much more consistent despite having a number of other services operating at the same time. There are some clear results this time:

Java vs Scala vs Scala Adv - Fibonacci
Running on a host machine, the Java implementation has a slight performance edge over the Scala simple. The Scala advanced implementation is slower. These tests were to the 100,000th Fibonacci value

 

Test1Loop.NoAdv_JMVmon_1.000.000_HOST
Stretching the test the 1,000,000th Fibonacci value meant that Scala Adv overran the heap but the difference between Scala simple and Java was negligible

From the results it appears that the variations in response times are most likely caused by external factors (ie: OS or other services taking resouces, Garbage collections etc). I will really need to increase the sample size to get a better understanding of what is influencing response times.

Test1Loop.All3_ResponseTime_v10000
Fibonacci sequence to 10,000th value
Test1Loop.All3_ResponseTime_v100000
Fibonacci sequence to 100,000th value
Categories
ITOps

Scala vs Java performance – dynaTrace Analysis – Test 1 Fibonacci Sequence

Trying to make some comparisons between Scala and Java. Using dynaTrace I can get very useful and granular insight into the behaviours of each.  To keep things as comparable as possible the tests will be on simple, short algorithms. Large iterations will be conducted so garbage collection and space complexity can be analysed.

Thanks to Julio Faerman for his post and the implementations I have used from that post – see: http://www.infoq.com/articles/scala-java-myths-facts

With dynaTrace I can get information about the JVM behaviour but also about the behaviour of specific classes and methods. This will not only allow me to make some comparison between Java and Scala implementations but also understand more about what happens when Scala is compiled to byte code.

To start with, nice and simple test – Fibonacci sequence

Three implementations will be tested:

Java Fibonacci  – see code

Scala Simple Fibonacci  – see code

Scala Advanced Fibonacci  – see code

I am running the tests from a controller class with a main method. From there I can instantiate the classes above. Read more about dynaTrace to see how it uses sensors in the JVM to grab a great deal of useful data about the JVM and about classes and methods.

Just to get things rolling, checked out how running the Fib sequence to the 10 million number in the sequence would affect the JVM.  This tests would have taken a long time to complete so after I saw stable behaviour (ie: two sequences) I kill the JVM and moved on to the next sequence. 

The charts below have been modified for human readability. For example, Garbage Collection in the charts was factored up. The charts are also stacked for readability. Within the dynaTrace client this is much easier to read with mouse over pop-ups.

What can be taken from this fairly simple example is that code may work and appear to be performing well. In reality it may be wreaking havoc on the JVM. Next post will look at the response times of these implementations.

Test1Loop.All3_memory_v1
JVM Heap usage and Garbage Collection utilization
Test1Loop.NoAdv_memory_v1
Removing the Scala Adv Sequence allows for better comparison of Java and Scala Simple
Test1Loop.All3_GC_v1
Garbage Collection utilization and Garbage collection caused suspensions (time)
Test1Loop.NoAdv_GC_v1
GC utilization, CPU Utilization and Suspension time cause by GC – without Scala Advanced

 

Test details:

Environment:

Virtual Machine: Ubuntu 12.04 LTS 32 bit - 3072MB, 2 Processors
Physical Host: Windows 7 64-bit - 8GB, i7-2620M @ 2.70 GHz

Versions:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-4ubuntu3)
OpenJDK Server VM (build 20.0-b12, mixed mode)
Scala compiler version 2.9.2

Run script:

#!/bin/bash
scala_path=/usr/share/java/scala-library.jar
java -Xmx2048m -cp $scala_path:. -agentpath:/home/mark/Desktop/performance_testing/dynatrace/agent/lib/libdtagent.so=name=TestJVM_scala_vs_java,server=192.168.242.60:9998 PerformanceController
Categories
Random

MapReduce Programming Model and Hadoop

After hearing the praises of Hadoop I had a brief look it and the Map/Reduce paradigm. Most info was read from Wikipedia and references in wiki articles. In particular, this paper from Jeffrey Dean and Sanjay Ghemawat.

Hadoop is an opensource software framework that implements MapReduce and Google File System. The aim is to enable easy and robust deployment of highly parallel data set processing programs. It is important to note that the MapReduce model is applicable to embarrassingly parallel problems. Processing can occur on data that is stored in a database or filesystem.

Map/Reduce refers to the steps in the model:

Map: A master node takes an input problem and divides it into sub-problems, passing them to worker nodes. Worker nodes can further divide sub-problems.  For example, in the problem of counting word occurrence in a document the Map function will output a key/value pair every time it sees a a specified work – ie: (“searchterm”, 1).

Reduce: The reduce function takes the list of word(key)/values and sums the occurrences:

(foo, (1)

(bar, 1)

(foo, 1)

(bar, 1)

(foo, 1)

The output of reduce in this case could be: (foo, 3).

The MapReduce model becomes powerful when considering giant datasets can be processed in large clusters very efficiently using this model.

Hadoop Run-time Takes care of:

  • Parallelalization
  • Partitioning of input data
  • Scheduling programs execution across machines
  • Handling machine failures
  • Managing inter-machine communication

Hadoop aims to enable developers with little distributed programming experience to utilize compute resources such as EC2.

With the emergence of ‘big data’ and the apparent value that can be extrapolated from massive databases/datastores, many organisations have found the limits of traditional relational databases. Hadoop has such a big buzz because it can pass the processing boundaries of relational database software and enable the extrapolation of value. The video below is a decent explanation of this point by data scientists at SalesForce.com


Categories
Random

Parsing HTML with Python

Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.

We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.

The link we were after was the most viewed channel at any particular time:

The most viewed channel is always in this location/element

First attempt was to load up that page in an <iframe> allowing our own JavaScript to grab the relevant URL then forward. Twitch.tv however forces forwarding when iframes are detected, presumably for proprietary reasons. So javascript was not really an option.

After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.

Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.

Links to source:

get_link.py – Python script

get_link.php – PHP calling python and using return value :

 

error_reporting(E_ALL);
$handle = popen('python -i ./get_link.py 2>&1', 'r');
$read = fread($handle, 2096);
pclose($handle);
header('Location: '.$read);

 

Python source:

import urllib
from lxml import html

def getElement():
        f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
        # Read from the object, storing the page's contents in 's'.
        s = f.read()
        f.close()# Read from the object, storing the page's contents in 's'.
        doc = html.document_fromstring(s)
        doc = doc.get_element_by_id('directory_channels')
        #rVal = trimDoc.text_content()
        doc = doc.find_class('thumb')
        rVal = html.tostring(doc[0]).split()[2]
        return makeUrl(rVal)

def makeUrl(thumbString):
        return "http://twitch.tv" +  thumbString[6:-2]

if __name__ == "__main__":
        print getElement()

 

Categories
Data Mining Project

Data Mining Project – Week 1

Started of with a bunch of data in CSV format from the World Bank. I did some initial exploration with Weka but found that everything was to complex to just plug in data and run searches. Despite doing some uni subjects on the topic I still do not have  a strong understanding  (or have forgotten) the low level details of pre-processing  and structuring data in the best way for Weka’s algorithms.

Using Python scripts is an easy way to work programatically with CSV files, particularly with the Python CSV library

An example of a very simple script for dealing with missing value is here: http://mchost/sourcecode/python/datamining/csv_nn_missing.py
Note in that implementation the replacing value is just zero. That can be changed to nearest neighbor or other preferred approximations.

I will use the ARFF file format for my own implementations as it seams to be a good standard and will mean I won’t have to keep modifying things if I want to use Weka in place of my own code.

 

So I have started working through the text book written by Weka’s creators:

Ian H. Witten, Eibe Frank, Mark A. Hall, 2011, Data Mining Practical Machine Learning Tools and Techniques 3rd Edition

Text Cover

I am breezing through the initial chapters which introduce the basic data mining concepts. There was a particularly interesting section on how difficult it is to maintain anonymity of data.

over 85% of Americans can be identified from publicly available records using just three pieces of information: five-digit zip code, birth date, and sex. Over half of Americans can be identified from just city, birth date, and sex.

In 2006, an Internet services company released to the research community the records of 20 million user searches. The New York Times were able to identify the actual person corresponding to user number 4417749 (they sought her permission before exposing her). They did so by analyzing the search terms she used, which included queries for landscapers in her hometown and for several people with the same last name as hers, which reporters correlated with public databases.

Netflix released 100 million records of movie ratings (from 1 to 5) with their dates. To their surprise, it turned out to be quite easy to identify people in the database and thus discover all the movies they had rated. For example, if you know approximately when (give or take two weeks) a person in the database rated six movies and you know the ratings, you can identify 99% of the people in the database.

Trying to get through the first of the books three sections this week so I can start implementing the practical algorithms. Hoping to do my own implementations rather than just using Weka as it will probably end up being quicker and I will definitely learn more.

Categories
Data Mining Project

Data Mining Project – Plan

Decided to try to apply the data mining techniques learnt from Intelligent systems course on publicly available economic data. The two sources for data that I will start of with are:

I will run a mix of supervised and unsupervised techniques. When conducting supervised analysis I will look for relationships between economic indications to provide inference on discussion topics such as:

  • The value of high equality in an economy
  • Benefits of non-livestock or livestock agriculture
  • Gains through geographic decentralization of population
  • Estimates on sustainable housing price ranges
  • The value of debt
  • Productivity of information technology
  • Cost and benefits of lax/tight immigration policies
  • Cost and benefits free-market/regulated/centralized economic governance

Techniques used for quantitative analysis will be varied a dependent on subsequent success. To start with I plan on using the raw data sources in conjunction with some simplistic python and Java scripts. If that turns out to be ineffective I will work with MatLab, Weka and Netica. Google and the World Bank also have numerous interfaces for exploring data. This will be an ongoing project so I will make these posts help myself keep track of progress.

 

gdp

Categories
Advanced Network Security

FIT5037 – Advanced Network Security Week 9

‘Network security and performance’ marked the ninth week of FIT5037. This is a logical extension of the previous weeks lecture of organizational level network security. There has traditionally been a mutual exclusivity between speed and security. This is most definitely a sore spot for many organizations, particularly when finding a degradation in performance after investing money! The lecture looked at common techniques that should be used to ensure convenience is not disproportional affected by security efforts. The notes outlined four key topics for the week:

  • Load balancing and firewalls
  • VPN and network performance
  • Network address translation [NAT] and load balancing
  • Network security architecture

Key awareness issues that were recurring through the lecture:

  • Security! – Does a software/hardware/architecture solution or combination of these provide sufficient security
  • Speed and availability – Do security solutions allow for the required level of service availability for operational requirements? Is service speed affected to an unacceptable extent?
  • Robustness – If one component fails, what are the repercussion for the rest of the network in terms of previous issues?
robustnetwork
Example of adjustments to design in consideration to organisational concerns (source: notes10)

The diagram above illustrates how the adoption of load balancers and multiple parallel firewalls suffices speed and robustness requirements.

The lecture went on to introduce the topics of protocol security and certain VPN solutions.