To plot or to ggplot, that is not the question

Producing informative and aesthetically pleasing quantitative visualizations is hard work.  Any tool or library that helps me with this task is worth considering.  Since I do most of my work in R, I have a choice of using plot, the default plotting library, a more powerful lattice package, and ggplot, which is based on the Grammar of Graphics.

There is usually a tradeoff between the expressiveness of the grammer and the learning curve necessary to master it. I have recently invested 3 days of my life learning the ins and outs of ggplot and I have to say that it has been most rewarding.

The fundamental difference between plot and ggplot is that in plot you manipulate graphical elements directly using predefined functions, whereas in ggplot you build the plot one layer at a time and can supply your own functions, although you can do quite a bit (but not everything) with a function called qplot, which abstracts the layering from the user and works similar to plot.  And therefore qplot is exactly where you want to start when upgrading from plot.

To demonstrate, the following R code partly visualizes the famous iris dataset containing Sepal and Petal measurements of three species of Iris flower using the built in plot function.

par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.012, las=1)
with(iris, plot(Sepal.Length, Sepal.Width, col=as.numeric(Species)+1, pch=20))
lbs = levels(iris$Species)
legend('topright', legend=lbs, 
       col=2:4, cex=0.7, pch=20, box.lwd=0.5, pt.cex=0.6)

One of the problems with plot is that the default plotting options are poorly chosen, so the first line of code fixed the margins, tick marks, and the orientation of the y axis tick labels.  The parameter col=as.numeric(Species) + 1 fixes the color offset at Red as opposed to the default Black.  Type palette() at the R prompt to see the default color vector.

The last complication is that plot does not draw the legend for you; it must be specified by hand.  And so, if you run the above code in R, you should get the following output.

It took a little bit of work, but the output looks pretty good.  Following is the equivalent task using ggplot’s qplot function.

qplot(Sepal.Length, Sepal.Width, data = iris, colour = Species, xlim=c(4,8))

As you can see, ggplot chooses a lot more sensible defaults and in this particular case, the interface for specifying the intent of the user is very simple and intuitive.

A final word of caution.  Just like a skier who sticks to blue and green slopes is in danger of never making it out of the intermediate hell, so is the qplot user will never truly master the grammar of graphics.  For those who dare to use a much more expressive ggplot(…) function, the rewards are well worth the effort.

Here are some of the ggplot references that I found valuable.

 

 

 

A Better Way to Learn Applied Statistics, Got Zat? (Part 2)

Earning a PhD for DummiesIn the second semester of grad school, I remember sitting in a Statistical Inference class watching a very Russian sounding instructor fast forward through an overhead projected PDF document filled with numbered equations and occasionally making comments like: “Vell, ve take zis eqazion on ze top and ve substitude it on ze butom, and zen it verk out.  Do you see zat ?”  I did not see zat.  I don’t think many people saw zat.

In case I come off as an intolerant immigrant hater, let me assure you that as an immigrant from the former Soviet block, I have all due respect for the very bright Russian and non-Russian scientists who came to the United States to seek intellectual and other freedoms.  But this post is not about immigration, which incidentally is in need of serious reform.  This is about an important subject, which on average is not being taught very well.

This is hardly news, but many courses in Statistics are being taught by very talented statisticians who have no aptitude or interest in the teaching method. But poor instructors are not the only problem.  These courses are part of an institution, an institution that is no longer in the business of providing education.  Universities predominantly sell accreditation to students, and research to (mostly) the federal government.  While I believe that government-sponsored research should be a foundation of modern society, it does not have to be delivered within the confines of a teaching institution.  And a university diploma, even from a top school (i.e. accreditation), is at best a proxy for your knowledge and capabilities.  For example, if you are a software engineer, Stack Overflow and GitHub provide much more direct evidence of your abilities.

With the cost of higher education skyrocketing, it is reasonable to ask if the traditional university education is still relevant?  I am not sure about medicine, but in statistics, the answer is a resounding ‘No.’  Unless you want to be a professor.  But chances are you will not be a professor, even if you get your coveted Ph.D.

So for all of you aspiring Data Geeks, I put together a table outlining Online Classes, Books, and Community and Q&A Sites that completely bypass the traditional channels. And if you really want to go to school, most Universities will allow you to audit classes, so that is always an option. Got Zat?

Online Classes Books Community / Q&A
Programming Computer Science Courses at Udacity. Currently Introduction to Computer Science, Logic and Discrete Mathematics (great for preparation for Probability), Programming Languages, Design of Computer Programs, and Algorithms.

For a highly interactive experience try Codecademy.

How to Think Like a Computer Scientist ( Allen B. Downey)

Code Complete (Steve McConnell)

Stack Overflow
Foundational Math Singel Variable Calculus Course on Coursera (they are adding others; check that site often)

Khan Academy Linear Algebra Series

Khan Academy Calculus Series (including multivariate)

Gilbert Strang’s Linear Algebra Course

Intro to Linear Algebra (Gilbert Strang)

Calculus, an Intuitive and Physical Approach (Morris Kline)

Math Overflow
Intro to Probability
and Statistics
Statistics One from Coursera. This course includes an Introduction to R language.

Introduction to Statistics from Udacity.

Stats: Data and Models (Richard De Veaux) Cross Validated, which tends to be more advanced
Probability and Statistical
Theory
It is very lonely here… Introduction to Probability Models(Sheldon Ross)

Statistical Inference (Casella and Berger)

Cross Validated
Applied and Computational
Statistics
Machine Learning from Coursera.

Statistics and Data Analysis curriculum from Coursera.

Statistical Sleuth(Ramsey and Schafer)

Data Analysis Using Regression and Multilevel Models (Gelman)

Pattern Recognition and Machine Learning (Chris Bishop)

Elements of Statistical Learning (Hastie, Tibshirani, Friedman)

Stack Overflow especially under the R tag

New York Open Statistical Programming Meetup, try searching Meetups in your city

Bayesian Statistics Not to my knowledge, but check the above-mentioned sites. Bayesian Data Analysis (Gelman)

Doing Bayesian Data Analysis (Kruschke)

I don’t know of any specialized sites for this.