Categories
Data Mining Project

Week 3 – Introduction to R. Cont’

Coming back to R after closing, a session can be restored by simply running R in the workspace directory.

A history file can be specified via:

# recall your command history 
loadhistory(file="myfile") # default is ".Rhistory"

RData can also be saved and loaded via:

# save the workspace to the file .RData in the cwd 
save.image()

# save specific objects to a file
# if you don't specify the path, the cwd is assumed 
save(object list,file="myfile.RData")
# load a workspace into the current session
# if you don't specify the path, the cwd is assumed 
load("myfile.RData")

Describing data:

# show data files attached
ls()
# show dimensions of  a data object 'd'
dim(d)
#show structure of data object 'd'
str(d)
#summary of data 'd'
summary(d)

Subsets of data is a logical next step:

summary(subset(d, read <= 60))

Grouping data is also fairly intuitive:

by(d[, 7:11], d$prog, colMeans)
by(d[, 7:11], d$prog, summary)

Using histograms to plot variable distributions:

ggplot(d, aes(x = write)) + geom_histogram()
# Or kernel density plots
ggplot(d, aes(x = write)) + geom_density()
# Or boxplots showing the median, lower and upper quartiles and the full range
ggplot(d, aes(x = 1, y = math)) + geom_boxplot()

Lets look at some more ways to understand the data set:

# density plots by program type
ggplot(d, aes(x = write)) + geom_density() + facet_wrap(~prog)
# box plot of math scores for each teaching program
ggplot(d, aes(x = factor(prog), y = math)) + geom_boxplot()

Extending visualizations:

ggplot(melt(d[, 7:11]), aes(x = variable, y = value)) + geom_boxplot()
# break down by program:
ggplot(melt(d[, 6:11], id.vars = "prog"), aes(x = variable, y = value, fill = factor(prog))) +  geom_boxplot()

Analysis of categories can be conducted with frequency tables:

xtabs(~female, data = d)
xtabs(~race, data = d)
xtabs(~prog, data = d)
xtabs(~ses + schtyp, data = d)

Finally lets have a look at some bivatiate (pairwise) correlations. If ther is no missing data, cor function can be users, else use can remove items:

cor(d[, 7:11])
ggpairs(d[, 7:11])

 

Leave a Reply

Your email address will not be published. Required fields are marked *