July 28, 2013 – Mark's IT Blog

Why use R?

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. – http://cran.csiro.au/doc/manuals/r-release/R-intro.html

This is a valid question considering that most languages/frameworks, including CUDA have statistical analysis libraries built in. Hopefully running through some introductory exercises will reveal the benefits.

Associated GUI’s and extensions:

Weka – Specific for machine learning algorithms
R Commander – Data analysis GUI

Install on Ubuntu 12.04:

sudo echo "deb http://cran.csiro.au/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
sudo apt-get update
sudo apt-get install r-base

Then to enter the R command line interface, $ R

For starters, will run through an intro from UCLA: http://www.ats.ucla.edu/stat/r/seminars/intro.htm

Within the R command line interface if a package is to be used it must first be installed:

install.packages()

foreign – package to read data files from other stats packages
xlsx – package (requires Java to be installed, same architecture as your R version, also the rJava package and xlsxjars package)
reshape2 – package to easily melt data to long form
ggplot2 – package for elegant data visualization using the Grammar of Graphics
GGally – package for scatter plot matrices
vcd – package for visualizing and analyzing categorical data

install.packages("xlsx")
install.packages("reshape2")
install.packages("ggplot2")
install.packages("GGally")
install.packages("vcd")

Pre-requisites:

sudo apt-get install openjdk-7-*
sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/bin/java /etc/alternatives/java
sudo R CMD javareconf

Preparing session:

After installing R and the packages needed for a task if these packages are needed in the current session they must be included:

require(foreign)
require(xlsx)

After attaching all of the required packages to the current session, confirmation can be completed via:

sessionInfo()

R code can be entered into the command line directly or saved to a script which can be run inside a session using the ‘source’ function.

Help can be attained using ? preceding a function name.

Entering Data:

R is most compatible with datasets stored as text files, ie: csv.

Base R contains functions read.table and read.csv see the help files on these functions for many options.

# comma separated values
dat.csv <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
# tab separated values
dat.tab <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt", header=TRUE, sep = "\t")

Datasets from other statistical analysis software can be imported using the foreign package:

require(foreign)
# SPSS files
dat.spss <- read.spss("http://www.ats.ucla.edu/stat/data/hsb2.sav", to.data.frame=TRUE)
# Stata files
dat.dta <- read.dta("http://www.ats.ucla.edu/stat/data/hsb2.dta")

If converting excel spreadsheets to CSV is too much of a hassle the xlxs package we imported will do the job:

# these two steps only needed to read excel files from the internet
f <- tempfile("hsb2", fileext=".xls")
download.file("http://www.ats.ucla.edu/stat/data/hsb2.xls", f, mode="wb")
dat.xls <- read.xlsx(f, sheetIndex=1)

Viewing Data:

# first few rows
head(dat.csv)
# last few rows
tail(dat.csv)
# variable names
colnames(dat.csv)
# pop-up view of entire data set (uncomment to run)
View(dat.csv)

Datasets that have been read in are stored as data frames which have a matrix structure. The most common method of indexing is object[row,column] but many others are available.

# single cell value
dat.csv[2, 3]
# omitting row value implies all rows; here all rows in column 3
dat.csv[, 3]
# omitting column values implies all columns; here all columns in row 2
dat.csv[2, ]
# can also use ranges - rows 2 and 3, columns 2 and 3
dat.csv[2:3, 2:3]

Variables can also be accessed via their names:

# get first 10 rows of variable female using two methods
dat.csv[1:10, "female"]
dat.csv$female[1:10]

The c function is used to combine values of common type together to form a vector:

# get column 1 for rows 1, 3 and 5
dat.csv[c(1, 3, 5), 1]
## [1]  70  86 172
# get row 1 values for variables female, prog and socst
dat.csv[1, c("female", "prog", "socst")]
##   female prog socst
## 1      0    1    57

Creating colnames:

colnames(dat.csv) <- c("ID", "Sex", "Ethnicity", "SES", "SchoolType", "Program", 
    "Reading", "Writing", "Math", "Science", "SocialStudies")

# to change one variable name, just use indexing
colnames(dat.csv)[1] <- "ID2"

Saving data:

#write.csv(dat.csv, file = "path/to/save/filename.csv")
#write.table(dat.csv, file = "path/to/save/filename.txt", sep = "\t", na=".")
#write.dta(dat.csv, file = "path/to/save/filename.dta")
#write.xlsx(dat.csv, file = "path/to/save/filename.xlsx", sheetName="hsb2")
# save to binary R format (can save multiple datasets and R objects)
#save(dat.csv, dat.dta, dat.spss, dat.txt, file = "path/to/save/filename.RData")
#change workspace directory
setwd("/home/a/Desktop/R/testspace1")