Why use R?
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. – http://cran.csiro.au/doc/manuals/r-release/R-intro.html
This is a valid question considering that most languages/frameworks, including CUDA have statistical analysis libraries built in. Hopefully running through some introductory exercises will reveal the benefits.
Associated GUI’s and extensions:
- Weka – Specific for machine learning algorithms
- R Commander – Data analysis GUI
Install on Ubuntu 12.04:
123sudo echo "deb http://cran.csiro.au/bin/linux/ubuntu precise/" >> /etc/apt/sources.listsudo apt-get updatesudo apt-get install r-base
Then to enter the R command line interface, $ R
For starters, will run through an intro from UCLA: http://www.ats.ucla.edu/stat/r/seminars/intro.htm
Within the R command line interface if a package is to be used it must first be installed:1install.packages()
- foreign – package to read data files from other stats packages
- xlsx – package (requires Java to be installed, same architecture as your R version, also the rJava package and xlsxjars package)
- reshape2 – package to easily melt data to long form
- ggplot2 – package for elegant data visualization using the Grammar of Graphics
- GGally – package for scatter plot matrices
- vcd – package for visualizing and analyzing categorical data
Pre-requisites:123sudo apt-get install openjdk-7-*sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/bin/java /etc/alternatives/javasudo R CMD javareconf
After installing R and the packages needed for a task if these packages are needed in the current session they must be included:12require(foreign)require(xlsx)
After attaching all of the required packages to the current session, confirmation can be completed via:1sessionInfo()
R code can be entered into the command line directly or saved to a script which can be run inside a session using the ‘source’ function.
Help can be attained using ? preceding a function name.
R is most compatible with datasets stored as text files, ie: csv.
Base R contains functions read.table and read.csv see the help files on these functions for many options.1234# comma separated valuesdat.csv <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")# tab separated valuesdat.tab <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt", header=TRUE, sep = "\t")
Datasets from other statistical analysis software can be imported using the foreign package:12345require(foreign)# SPSS filesdat.spss <- read.spss("http://www.ats.ucla.edu/stat/data/hsb2.sav", to.data.frame=TRUE)# Stata filesdat.dta <- read.dta("http://www.ats.ucla.edu/stat/data/hsb2.dta")
If converting excel spreadsheets to CSV is too much of a hassle the xlxs package we imported will do the job:1234# these two steps only needed to read excel files from the internetf <- tempfile("hsb2", fileext=".xls")download.file("http://www.ats.ucla.edu/stat/data/hsb2.xls", f, mode="wb")dat.xls <- read.xlsx(f, sheetIndex=1)
Viewing Data:12345678# first few rowshead(dat.csv)# last few rowstail(dat.csv)# variable namescolnames(dat.csv)# pop-up view of entire data set (uncomment to run)View(dat.csv)
Datasets that have been read in are stored as data frames which have a matrix structure. The most common method of indexing is object[row,column] but many others are available.12345678# single cell valuedat.csv[2, 3]# omitting row value implies all rows; here all rows in column 3dat.csv[, 3]# omitting column values implies all columns; here all columns in row 2dat.csv[2, ]# can also use ranges - rows 2 and 3, columns 2 and 3dat.csv[2:3, 2:3]
Variables can also be accessed via their names:123# get first 10 rows of variable female using two methodsdat.csv[1:10, "female"]dat.csv$female[1:10]
The c function is used to combine values of common type together to form a vector:1234567# get column 1 for rows 1, 3 and 5dat.csv[c(1, 3, 5), 1]##  70 86 172# get row 1 values for variables female, prog and socstdat.csv[1, c("female", "prog", "socst")]## female prog socst## 1 0 1 57
Creating colnames:12345colnames(dat.csv) <- c("ID", "Sex", "Ethnicity", "SES", "SchoolType", "Program","Reading", "Writing", "Math", "Science", "SocialStudies")# to change one variable name, just use indexingcolnames(dat.csv) <- "ID2"
Saving data:12345678#write.csv(dat.csv, file = "path/to/save/filename.csv")#write.table(dat.csv, file = "path/to/save/filename.txt", sep = "\t", na=".")#write.dta(dat.csv, file = "path/to/save/filename.dta")#write.xlsx(dat.csv, file = "path/to/save/filename.xlsx", sheetName="hsb2")# save to binary R format (can save multiple datasets and R objects)#save(dat.csv, dat.dta, dat.spss, dat.txt, file = "path/to/save/filename.RData")#change workspace directorysetwd("/home/a/Desktop/R/testspace1")