Categories
Data Mining Project

Week 2 – Introduction to R.

Why use R?

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. – http://cran.csiro.au/doc/manuals/r-release/R-intro.html

This is a valid question considering that most languages/frameworks, including CUDA have statistical analysis libraries built in. Hopefully running through some introductory exercises will reveal the benefits.

Associated GUI’s and extensions:

  • Weka – Specific for machine learning algorithms
  • R Commander – Data analysis GUI

Install on Ubuntu 12.04:

  1. sudo echo "deb http://cran.csiro.au/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
    sudo apt-get update
    sudo apt-get install r-base

    Then to enter the R command line interface, $ R

    For starters, will run through an intro from UCLA: http://www.ats.ucla.edu/stat/r/seminars/intro.htm

    Within the R command line interface if a package is to be used it must first be installed:

    install.packages()
    • foreign – package to read data files from other stats packages
    • xlsx – package (requires Java to be installed, same architecture as your R version, also the rJava package and xlsxjars package)
    • reshape2 – package to easily melt data to long form
    • ggplot2 – package for elegant data visualization using the Grammar of Graphics
    • GGally – package for scatter plot matrices
    • vcd – package for visualizing and analyzing categorical data
    install.packages("xlsx")
    install.packages("reshape2")
    install.packages("ggplot2")
    install.packages("GGally")
    install.packages("vcd")

    Pre-requisites:

    sudo apt-get install openjdk-7-*
    sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/bin/java /etc/alternatives/java
    sudo R CMD javareconf

    Preparing session:

    After installing R and the packages needed for a task if these packages are needed in the current session they must be included:

    require(foreign)
    require(xlsx)

    After attaching all of the required packages to the current session, confirmation can be completed via:

    sessionInfo()

    R code can be entered into the command line directly or saved to a script which can be run inside a session using the ‘source’ function.

    Help can be attained using ? preceding a function name.

    Entering Data:

    R is most compatible with datasets stored as text files, ie: csv.

    Base R contains functions read.table and read.csv see the help files on these functions for many options.

    # comma separated values
    dat.csv <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
    # tab separated values
    dat.tab <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt", header=TRUE, sep = "\t")
    

    Datasets from other statistical analysis software can be imported using the foreign package:

    require(foreign)
    # SPSS files
    dat.spss <- read.spss("http://www.ats.ucla.edu/stat/data/hsb2.sav", to.data.frame=TRUE)
    # Stata files
    dat.dta <- read.dta("http://www.ats.ucla.edu/stat/data/hsb2.dta")

    If converting excel spreadsheets to CSV is too much of a hassle the xlxs package we imported will do the job:

    # these two steps only needed to read excel files from the internet
    f <- tempfile("hsb2", fileext=".xls")
    download.file("http://www.ats.ucla.edu/stat/data/hsb2.xls", f, mode="wb")
    dat.xls <- read.xlsx(f, sheetIndex=1)

    Viewing Data:

    # first few rows
    head(dat.csv)
    # last few rows
    tail(dat.csv)
    # variable names
    colnames(dat.csv)
    # pop-up view of entire data set (uncomment to run)
    View(dat.csv)
    

    Datasets that have been read in are stored as data frames which have a matrix structure. The most common method of indexing is object[row,column] but many others are available.

    # single cell value
    dat.csv[2, 3]
    # omitting row value implies all rows; here all rows in column 3
    dat.csv[, 3]
    # omitting column values implies all columns; here all columns in row 2
    dat.csv[2, ]
    # can also use ranges - rows 2 and 3, columns 2 and 3
    dat.csv[2:3, 2:3]

    Variables can also be accessed via their names:

    # get first 10 rows of variable female using two methods
    dat.csv[1:10, "female"]
    dat.csv$female[1:10]

    The c function is used to combine values of common type together to form a vector:

    # get column 1 for rows 1, 3 and 5
    dat.csv[c(1, 3, 5), 1]
    ## [1]  70  86 172
    # get row 1 values for variables female, prog and socst
    dat.csv[1, c("female", "prog", "socst")]
    ##   female prog socst
    ## 1      0    1    57

    Creating colnames:

    colnames(dat.csv) <- c("ID", "Sex", "Ethnicity", "SES", "SchoolType", "Program", 
        "Reading", "Writing", "Math", "Science", "SocialStudies")
    
    # to change one variable name, just use indexing
    colnames(dat.csv)[1] <- "ID2"

    Saving data:

    #write.csv(dat.csv, file = "path/to/save/filename.csv")
    #write.table(dat.csv, file = "path/to/save/filename.txt", sep = "\t", na=".")
    #write.dta(dat.csv, file = "path/to/save/filename.dta")
    #write.xlsx(dat.csv, file = "path/to/save/filename.xlsx", sheetName="hsb2")
    # save to binary R format (can save multiple datasets and R objects)
    #save(dat.csv, dat.dta, dat.spss, dat.txt, file = "path/to/save/filename.RData")
    #change workspace directory
    setwd("/home/a/Desktop/R/testspace1")