Week 3 – Introduction to R. Cont’

Coming back to R after closing, a session can be restored by simply running R in the workspace directory.

A history file can be specified via:

RData can also be saved and loaded via:

Describing data:

Subsets of data is a logical next step:

Grouping data is also fairly intuitive:

Using histograms to plot variable distributions:

Lets look at some more ways to understand the data set:

Extending visualizations:

Analysis of categories can be conducted with frequency tables:

Finally lets have a look at some bivatiate (pairwise) correlations. If ther is no missing data, cor function can be users, else use can remove items:


Week 2 – Introduction to R.

Why use R?

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. – http://cran.csiro.au/doc/manuals/r-release/R-intro.html

This is a valid question considering that most languages/frameworks, including CUDA have statistical analysis libraries built in. Hopefully running through some introductory exercises will reveal the benefits.

Associated GUI’s and extensions:

  • Weka – Specific for machine learning algorithms
  • R Commander – Data analysis GUI

Install on Ubuntu 12.04:

  1. Then to enter the R command line interface, $ R

    For starters, will run through an intro from UCLA: http://www.ats.ucla.edu/stat/r/seminars/intro.htm

    Within the R command line interface if a package is to be used it must first be installed:

    • foreign – package to read data files from other stats packages
    • xlsx – package (requires Java to be installed, same architecture as your R version, also the rJava package and xlsxjars package)
    • reshape2 – package to easily melt data to long form
    • ggplot2 – package for elegant data visualization using the Grammar of Graphics
    • GGally – package for scatter plot matrices
    • vcd – package for visualizing and analyzing categorical data


    Preparing session:

    After installing R and the packages needed for a task if these packages are needed in the current session they must be included:

    After attaching all of the required packages to the current session, confirmation can be completed via:

    R code can be entered into the command line directly or saved to a script which can be run inside a session using the ‘source’ function.

    Help can be attained using ? preceding a function name.

    Entering Data:

    R is most compatible with datasets stored as text files, ie: csv.

    Base R contains functions read.table and read.csv see the help files on these functions for many options.

    Datasets from other statistical analysis software can be imported using the foreign package:

    If converting excel spreadsheets to CSV is too much of a hassle the xlxs package we imported will do the job:

    Viewing Data:

    Datasets that have been read in are stored as data frames which have a matrix structure. The most common method of indexing is object[row,column] but many others are available.

    Variables can also be accessed via their names:

    The c function is used to combine values of common type together to form a vector:

    Creating colnames:

    Saving data:

Data Mining Project – Week 1

Started of with a bunch of data in CSV format from the World Bank. I did some initial exploration with Weka but found that everything was to complex to just plug in data and run searches. Despite doing some uni subjects on the topic I still do not have  a strong understanding  (or have forgotten) the low level details of pre-processing  and structuring data in the best way for Weka’s algorithms.

Using Python scripts is an easy way to work programatically with CSV files, particularly with the Python CSV library

An example of a very simple script for dealing with missing value is here: http://mchost/sourcecode/python/datamining/csv_nn_missing.py
Note in that implementation the replacing value is just zero. That can be changed to nearest neighbor or other preferred approximations.

I will use the ARFF file format for my own implementations as it seams to be a good standard and will mean I won’t have to keep modifying things if I want to use Weka in place of my own code.


So I have started working through the text book written by Weka’s creators:

Ian H. Witten, Eibe Frank, Mark A. Hall, 2011, Data Mining Practical Machine Learning Tools and Techniques 3rd Edition

Text Cover

I am breezing through the initial chapters which introduce the basic data mining concepts. There was a particularly interesting section on how difficult it is to maintain anonymity of data.

over 85% of Americans can be identified from publicly available records using just three pieces of information: five-digit zip code, birth date, and sex. Over half of Americans can be identified from just city, birth date, and sex.

In 2006, an Internet services company released to the research community the records of 20 million user searches. The New York Times were able to identify the actual person corresponding to user number 4417749 (they sought her permission before exposing her). They did so by analyzing the search terms she used, which included queries for landscapers in her hometown and for several people with the same last name as hers, which reporters correlated with public databases.

Netflix released 100 million records of movie ratings (from 1 to 5) with their dates. To their surprise, it turned out to be quite easy to identify people in the database and thus discover all the movies they had rated. For example, if you know approximately when (give or take two weeks) a person in the database rated six movies and you know the ratings, you can identify 99% of the people in the database.

Trying to get through the first of the books three sections this week so I can start implementing the practical algorithms. Hoping to do my own implementations rather than just using Weka as it will probably end up being quicker and I will definitely learn more.

Data Mining Project – Plan

Decided to try to apply the data mining techniques learnt from Intelligent systems course on publicly available economic data. The two sources for data that I will start of with are:

I will run a mix of supervised and unsupervised techniques. When conducting supervised analysis I will look for relationships between economic indications to provide inference on discussion topics such as:

  • The value of high equality in an economy
  • Benefits of non-livestock or livestock agriculture
  • Gains through geographic decentralization of population
  • Estimates on sustainable housing price ranges
  • The value of debt
  • Productivity of information technology
  • Cost and benefits of lax/tight immigration policies
  • Cost and benefits free-market/regulated/centralized economic governance

Techniques used for quantitative analysis will be varied a dependent on subsequent success. To start with I plan on using the raw data sources in conjunction with some simplistic python and Java scripts. If that turns out to be ineffective I will work with MatLab, Weka and Netica. Google and the World Bank also have numerous interfaces for exploring data. This will be an ongoing project so I will make these posts help myself keep track of progress.