Data Mining Project – Week 1

Started of with a bunch of data in CSV format from the World Bank. I did some initial exploration with Weka but found that everything was to complex to just plug in data and run searches. Despite doing some uni subjects on the topic I still do not have a strong understanding (or have forgotten) the low level details of pre-processing and structuring data in the best way for Weka’s algorithms.

Using Python scripts is an easy way to work programatically with CSV files, particularly with the Python CSV library

An example of a very simple script for dealing with missing value is here: http://mchost/sourcecode/python/datamining/csv_nn_missing.py
Note in that implementation the replacing value is just zero. That can be changed to nearest neighbor or other preferred approximations.

I will use the ARFF file format for my own implementations as it seams to be a good standard and will mean I won’t have to keep modifying things if I want to use Weka in place of my own code.

So I have started working through the text book written by Weka’s creators:

Ian H. Witten, Eibe Frank, Mark A. Hall, 2011, Data Mining Practical Machine Learning Tools and Techniques 3rd Edition

I am breezing through the initial chapters which introduce the basic data mining concepts. There was a particularly interesting section on how difficult it is to maintain anonymity of data.

over 85% of Americans can be identified from publicly available records using just three pieces of information: five-digit zip code, birth date, and sex. Over half of Americans can be identified from just city, birth date, and sex.

In 2006, an Internet services company released to the research community the records of 20 million user searches. The New York Times were able to identify the actual person corresponding to user number 4417749 (they sought her permission before exposing her). They did so by analyzing the search terms she used, which included queries for landscapers in her hometown and for several people with the same last name as hers, which reporters correlated with public databases.

Netflix released 100 million records of movie ratings (from 1 to 5) with their dates. To their surprise, it turned out to be quite easy to identify people in the database and thus discover all the movies they had rated. For example, if you know approximately when (give or take two weeks) a person in the database rated six movies and you know the ratings, you can identify 99% of the people in the database.

Trying to get through the first of the books three sections this week so I can start implementing the practical algorithms. Hoping to do my own implementations rather than just using Weka as it will probably end up being quicker and I will definitely learn more.

Leave a Reply