Getting started with ModSecurity

XSS, CSRF and similar types of web application attacks have overtaken SQL injections as the most commonly seen attacks on the internet ( A very large number of web application were written and deployed prior to the trend up in likelihood and awareness of XSS attacks. Thus, it is extremely important to have an effective method of testing for XSS vulnerabilities and mitigating them.

Changes to production code bases can be slow, costly and can miss unreported vulnerabilities quite easily. The use of application firewalls such as ModSecurity ( become an increasingly attractive solution when faced with a decision on how to mitigate current and future XSS vulnerabilities.

Mod Security can be embedded with Apache, NGINX and IIS which is relativity straight forward. In cases where alternative web severs are being used ModSecurity can still be a viable option by creating a reverse proxy (using Apache of NGINX).

How can ModSecurity be used?

  • Alerting
  • Transforming
  • Blocking
  • and more

These functions can be enacted by rules.

A default action can be created for a group of rules using the configuration directive “SecDefaultAction

Using the following SecDefaultAction at the top of rule set that we want enable blocking and transforming on is a blunt method of protection. Redirection can also be used as a method of blocking.

A powerful web application firewall - free software!
A powerful web application firewall – free software!

Example of a default action to be applied by ruleset (note defaults cascade through the ruleset files):

SecDefaultAction “phase:2,log,auditlog,deny,status:403,tag:’Unspecified usage'”

Rulesets have been created by OWASP.

Using optional rulesets,  modsecurity_crs_16_session_hijacking.conf and modsecurity_crs_43_csrf_protection.conf ModSecurity can provide protection against Cross Site Request Forgeries [CSRF]. The @rsub operators can inject a token on every form (and/or other html elements). ModSecurity can store the expected token value as a variable which is compared to the value posted via forms or other html elements. ModSecurity rules can be based on request methods and URIs etc – alongside the ability to chain rules there are a huge number of options for mitigating XSS and CSRF without impacting normal applicatioin usage.



  • SecRuleEngine On
  • SecRequestBodyAccess On
  • SecResponseBodyAccess On

## To enable @rsub

  • SecStreamOutBodyInspection On
  • SecStreamInBodyInspection On
  • SecContentInjection On

Injecting unique request id from mod_unique_id into forms:

Some simple rules:


  • Wide capabilities for logging, alerts, blocking, redirecting, transforming
  • Parses everything coming into your web server over HTTP
  • Virtual patching – if a vulnerability is made public that affects your web application you can write and deploy a rule to mitigate the vulnerability much faster than re-release of application code patched
  • Extended uses – the capabilities of ModSecurity can be applied to applications outside the scope of application security


  • Added complexity to your application delivery chain – another point for maintenance and failure
  • Performance costs? – Though I have not had the opportunity to test the performance costs holding session information in memory and inspecting every byte of HTTP traffic can’t be free from performance cost
  • Hardware costs – Particularly if using ModSecurity’s BodyAccess and BodyInspection features, memory usage will be significant

Improving deployments:

  • Starting off being aggressive on warnings and very light on action is a necessity to ensure no impact on normal application usage
  • From this point rules and actions need to be refined
  • Understanding how the applications works allows the use of ModSecuirtys header and body inspection in effective ways

Some other notes extracted from the ModSecurity Handbook – If you decide to use ModSecurity I strongly recommend buying the handbook. It is not expensive and saves a lot of time.


SSL Review part 1

Most of us use and rely on SSL everyday. The mathematical workings of the RSA [Rivest, Shamir, Adleman] algorithm are not overly complex but mapping everything back to what happens in reality requires detailed understanding. Skipping over the need for SSL (for confidential and authenticated exchange of a symmetric key over and insecure medium) I will review the mathematical workings then how they are applied in real world examples.

There are also details in previous posts – RSA1, RSA2


1. public key – e
standard practice to choose: 65537 
2. random primes p,q
Let’s use:



3. key modulus – n
 n = pq:


4. φ(n) = (p – 1)(q – 1)


5. find e that is co-prime with φ(n)
 already using a prime,which will be co- prime.. – e = 65537 (in binary 10000000000000001)
6. (ed) mod φ(n) = 1d = e–1 mod φ(n) –  Modular multiplicative inverseMore than one answer
using Extended Euclidean algorithm:



..+ 2φ(n), +3φ(n))

7. private key – d
 de–1 mod φ(n):


With a public key (e), a key modulus (n) and a private key (d) we can apply the RSA algorithm.

Message (mess) = 911

RSA encrypt -> mess ^ e mod n  = 911 ^65537 mod 4052729130775091849638047446256554071699019514021047339267026030072286291982163

RSA encrypted message (ciph) = 3095021178047041558314072884014000324030086129008597834642883051983162360819331

RSA decrypt -> ciph ^ d mod n = 3095021178047041558314072884014000324030086129008597834642883051983162360819331 ^ 944402082567056818708092537028397604145319798848072425038015030084640082599681 mod


= 911

How is that secure?

When Alice encrypts using Bob’s public key (e) along with the key modulus (n) the output is a protected cipher.

An eavesdropper does not know the private key so decryption is very difficult:

Attacker must solve:

(unknown val, x) ^ e mod n = ciph

x ^65537 mod 4052729130775091849638047446256554071699019514021047339267026030072286291982163 = 3095021178047041558314072884014000324030086129008597834642883051983162360819331

OR, easier – try to determine the private key:

The attacker knows e and n (which = pq). When we created the private key (step 6 above) we conducted:  e–1 mod φ(n) – Modular multiplicative inverse which is relatively fast for us to calculate.

The attacked does not know  e–1 mod φ(n) though. φ(n) = (p – 1)(q – 1). The attacker knows that n is a composite prime = pq (where p and q are both primes).

So… if the attacker can solve p * q = n (where they know n) then RSA is insecure.

Thankfully the process of Integer factorization is so much harder than the process of creating p,q,nφ(n), e and d that online business and confidentiality can be maintained to acceptable levels.

Threats to RSA

It would be extremely valuable to malicious individuals/groups and  (more importantly) intelligence organizations make large integer factorization efficient enough to break RSA.

However the theoretical aspects of RSA are not generally recognized as the main source of vulnerability

Week 3 – Introduction to R. Cont’

Coming back to R after closing, a session can be restored by simply running R in the workspace directory.

A history file can be specified via:

RData can also be saved and loaded via:

Describing data:

Subsets of data is a logical next step:

Grouping data is also fairly intuitive:

Using histograms to plot variable distributions:

Lets look at some more ways to understand the data set:

Extending visualizations:

Analysis of categories can be conducted with frequency tables:

Finally lets have a look at some bivatiate (pairwise) correlations. If ther is no missing data, cor function can be users, else use can remove items:


Week 2 – Introduction to R.

Why use R?

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. –

This is a valid question considering that most languages/frameworks, including CUDA have statistical analysis libraries built in. Hopefully running through some introductory exercises will reveal the benefits.

Associated GUI’s and extensions:

  • Weka – Specific for machine learning algorithms
  • R Commander – Data analysis GUI

Install on Ubuntu 12.04:

  1. Then to enter the R command line interface, $ R

    For starters, will run through an intro from UCLA:

    Within the R command line interface if a package is to be used it must first be installed:

    • foreign – package to read data files from other stats packages
    • xlsx – package (requires Java to be installed, same architecture as your R version, also the rJava package and xlsxjars package)
    • reshape2 – package to easily melt data to long form
    • ggplot2 – package for elegant data visualization using the Grammar of Graphics
    • GGally – package for scatter plot matrices
    • vcd – package for visualizing and analyzing categorical data


    Preparing session:

    After installing R and the packages needed for a task if these packages are needed in the current session they must be included:

    After attaching all of the required packages to the current session, confirmation can be completed via:

    R code can be entered into the command line directly or saved to a script which can be run inside a session using the ‘source’ function.

    Help can be attained using ? preceding a function name.

    Entering Data:

    R is most compatible with datasets stored as text files, ie: csv.

    Base R contains functions read.table and read.csv see the help files on these functions for many options.

    Datasets from other statistical analysis software can be imported using the foreign package:

    If converting excel spreadsheets to CSV is too much of a hassle the xlxs package we imported will do the job:

    Viewing Data:

    Datasets that have been read in are stored as data frames which have a matrix structure. The most common method of indexing is object[row,column] but many others are available.

    Variables can also be accessed via their names:

    The c function is used to combine values of common type together to form a vector:

    Creating colnames:

    Saving data:

downloading a youtube playlist and converting to mp3 (osx/linux(ubuntu))

Downloading youtube audio and converting mp3 is very simple now thanks to these guys:



Usage: sh [youtube playlist URL] [output_directory]


To get the youtube playlist URL view the playlist by clicking on its name, no videos will be playing, then click on share and the url will be highlighted:

click on share and the URL will appear

Introduction to Parallel Programming – Week 5 Part 2

Week 5 -> Optimizing GPU Programs (continued)

Goal: Maximize useful computation/second

After walking through APOD [Analyze, parallelize, optimize, deploy] the lecture turned to memory bandwidth. Using the CUDA utility deviceQuery to calculate memory bandwidth using memory clock rate and memory bus width.

After determining the maximum theoretical bandwidth at 40 Gb/s practical goals were set:

  • 40-60% -> OK
  • 60-75% -> good
  • >75% -> excellent
Using 12.5 GB/s of memory bandwidth is under utilizing!
Using 12.5 GB/s of memory bandwidth is under utilizing!

The algorithm being analyzed is a transpose. When DRAM utilization is low the first guess should be lack of coalescing.

In our example code we find that the reading of memory is well coalesced.. but the writing phase is strided by N elements (1024). This was described as ‘bad’.

So considering that most GPU codes are memory limited checking memory bandwidth utilization is very important.

Enter a tool – nSightm nvpp [nvidia visual profiler]  – confirms that the write to memory operations are utilizing very little memory bandwidth whilst the read  operations are at 100% utilization.

The solution for this algorithm is ’tiling’. Tiling utilizes shared memory, taking a tile of the input copying and transposing into output. The code for this solution can be found here:

Occupancy – Each SM has a limited number of:

  • Thread blocks -> 8
  • Threadss -> 1536/2048
  • registers for all threads -> 65536
  • bytes of shared memory 16K-48K

Maximizing number of threads on the SM [streaming multi processor] will maximize occupancy. The limits for specific hardware can be found via deviceQuery.

The transpose code was further optimized, better versions can be seen in the link above.

Shared memory bank conflicts ->organized into banks and depending on how threads access memory in tile, replays of shared memory accesses can occur. (ie: striping shared memory usage across banks).

Referring back to the goal of maximizing effective computation we have just address one factor:

  • minimize time waiting at barriers

The next to be address was:

  • minimize thread divergence

Some important definitions:

  • warp – set of threads that execute the same instruction at a time
  • SIMD – Single instruction, multiple data (CPU
  • SIMT – Single instruction, multiple thread

Thread divergence is can result in up to 32x slower code. Warps on nvidia hardware have 32 threads which apply single instruction to multiple threads.

Next topic was Streams, launching kernels in separate streams allows for concurrent execution:

Benefits of specifying CUDA streams

To create streams:

cudaSteam_t s1;



asynchronous memory transfer  – cudaMemcpyAsync – called on pinned memory.

Week 5 lectures

Week 5 code – failed to get shared memory working even with atomicAdd 🙁

Virtual Enigma Machine Implementation

After watching a video on numberphile thought it would be interesting to implement a virtual version on the Enigma cipher machine.

Step 1: Install Scala and Eclipse IDE for Scala.

Step 2: Review the numberphile video and how the Enigma machine works

Step 3: Implement each function of the machine

  1. Rotors
    • Rotor choices, rotors have different circuitry (0-24) – static value for each message
    • Initial rotor combination (0-25) – static value for each message
    • Rotor increment – Faster rotor, middle rotor, slow rotor
  2. Plug board
    • 10 wires
    • swapping 2 letters
    • static value for each message

Step 4: Write an interface between the functions and a basic interface (text?)

Step 5: Test!

Conclusion: Using Scala (trying anyway) the implementation is not very complex. What is more difficult is replicating the original machines where the output character could not equal the input character. The limitation Turing used to crack the machines. Might see if I can implement that later then test out the cracking method. Not sure how easy it would be to break the current implementation…

CODE (EnigmaMachine.scala, EnigmaMachine.class):



Introduction to Parallel Programming – Week 5 Part 1

Week 5 -> Optimizing GPU Programs.

Parallelizing and porting programs to run on CUDA is generally done to either solve bigger problems or to solve more problems. So optimizing programs to require less time may be beneficial.

It is important to note that optimization should be completed with reference to the goals of the program and the execution time of each part of the program. Once a function is no longer a performance bottle neck the returns on further optimization are likely to be diminished.

– decrease time spent on memory operations
– coalesce global memory access
– avoid thread divergence

These basic principles do have exceptions. For instance, the transfer of data from global to shared memory may increase time on memory operations but decrease overall execution time.

The lecture highlighted some types of optimization:

  1. Selecting the right algorithm -> Likely to have the largest impact
  2. Applying the basic principles for efficiency
  3. Architecture specific optimizations
  4. Micro optimizations (instruction level)

A methodology for the development process of parallel applications was subsequently suggested:

  1. Analyze -> Profile the applications identifying bottlenecks/hotspots
  2. Parallelize -> using approaches such as libraries/OpenMP/CUDA, also selecting algorithms
  3. Optimize -> Measurement focus
  4. Deploy -> Optimization should not be completed in a vacuum it is too difficult to predict and emulate real usage


A simple working example on optimizing the transposing matrices followed.

Timing the function from a standard serial implementation to moderately parallel example and finally implementing our own fully parallel code. The code for the example:

Focusing too much on a single function will generally yield diminishing returns
Focusing too much on a single function will generally yield diminishing returns


Introduction to Parallel Programming – Week 4 (Part 2)

Part 2 of week 4’s lectures delved into Sort algorithms.

Sort is difficult in parallel programming because things get more complex when adding the following objectives:

  • keeping the hardware busy 
  • Limiting branch divergence
  • Coalescing memory access

Odd-Even Sort (brick sort), parallel version of bubble sort:

Start with an array of elements comparing odds and evens with different polarity each iteration. The elements are swapped if they are out of order.

Step complexity: O(n), Work Complexity: O(n^2)

Merge Sort

Divide and conquer approach, ideal for GPUs.

Merging two sort lists together, n sorted, n/2 sorted, n/4 sorted…

Step complexity: O(log n), Work complexity: O(n log n)

General implementations use a different sort to get to 1024 chunk sorted lists.

Merging 2 sorted can be done in parallel, remembering compact.

How parallel merge works:

  1. With 2 sorted arrays launch a thread for every element in each array. The goal of each thread is to calculate the position of its element in the final list.
  2. Thread calculates it position in its own list, its index…
  3. Calculate its position in the other array -> Binary search 🙁 every thread does a binary search on the other array.
The different stages of merge sort require different algorithms to get the most out of hardware
The different stages of merge sort require different algorithms to get the most out of hardware

When we get to the top of the merge tree, allocating one huge task to only 1 streaming multiprocessor [SM] means that there will be however many other SMs we have sitting idle. So we want one merge to be spread across multiple SMs.

How do we make sub tasks? Spliters. The logic does not seem to complex but I will review the section on it below:


Moving away from the merge sort, a new approach was explored – Sorting networks. This is a form of oblivious algorithms, ie an algorithm that does the same thing all the time. This enables high level of parallelism.

Bitonic sequences Work complexity: (n log n^2), however step complexity

All scenarios take the same amount of time, the algorithm is oblivious! If input set are small enough to fit into shared memory then sorting networks are generally good performers.

The best performing algorithm on GPUs is Radix sort – the steps for radix sort on integers is:

  1. Start with the least significant bit of the integer
  2. split input into 2 sets, 1s and 0s – (Compact – see part 1)
  3. Proceed to next LSB and repeat’

Step complexity O(kn) where k is bits in representation and n is number of elements to sort.

The final sort algorithms discussed were Quicksort and Key Value sort. Quicksort requires recursion and the control structure is quite difficult to implement.

Week 4 Lectures

Week 4 Assignment