Skip to content
Prev 124 / 885 Next

I need your thoughts on teaching with R

On Wed, Mar 11, 2009 at 9:52 PM, Derek Ogle <DOgle at northland.edu> wrote:
I agree with what Derek, my neighbor to the north, has said.

I teach introductory engineering statistics using R and have done so
for several years, although I am never completely satisfied with how R
blends with the text in such a course.  I have tried using a standard
introductory engineering text, specifically Devore's "Probability and
Statistics for Engineering and the Sciences", supplemented with
material on R (see the Devore6 package on CRAN which John Verzani
updated for the 7th edition to Devore7), Peter Dalgaard's
"Introductory Statistics with R" and now Cohen and Cohen's "Statistics
and Data with R".  I have also looked at "Probability and Statistics
with R" by Ugarte et al.

With the exception of Peter's book I found myself fighting the text.
That is, I found myself saying "the text presents this material this
way but it is unnecessary and confusing.  Do things this other way."

In the case of Peter's book I could agree with his presentation but
the book is clearly oriented toward biostatistics and has little
coverage of probability.  It came about as a supplement to another
text used in a course and reads like that so it has to be supplemented
extensively, especially if your audience is not from medical fields.

I would dearly love to see an approach to teaching statistics that
takes advantage of the graphical and computational capabilities of R
to remove redundant topics from the typical introductory course.
Sadly the last two texts I list (Cohen and Cohen, 2008;  Ugarte et al,
2008) do exactly the opposite.  Instead of using R to simplify an
approach to statistics they complicate an introductory course by
adding page after page of confusing R code.

What do I mean by simplify?  There are many topics in an introductory
statistics course that are ingrained in the curriculum but really are
there for the sake of approximation or computational simplification.
How many introductory texts still describe how to approximate a
"difficult" distribution by a "simpler" distribution (hypergeometric
by binomial, binomial by Poisson or Gaussian, etc.)?  When you can
calculate the exact probability why do you want to waste time teaching
an approximation and rules like "when np > 5 ..."?  Even a basic
graphical presentation, the histogram, is outmoded.  The purpose of
the histogram is to give us a picture of the density.  Why not use a
density plot for this?  There is a great advantage in that you can
easily overlay density plots from different groups, not to mention the
fact that it shows a smooth approximation to the density.  In the past
we used histograms because it was comparatively simple to choose bins
and count the observations in the bins then produce a bar chart.  We
can do better than that now.

Think carefully about the graphics.  Deepayan Sarkar (lattice) and
Hadley Wickham (ggplot2) have provided powerful techniques for
exploring data.  Students should benefit from that if they can do so
without needing to learn many, many details of the language.

When teaching the principles of hypothesis testing I describe a
p-value as "the probability of seeing the data that we did or
something more unusual when the null hypothesis (usually meaning "no
change") is true".  The closer that probability is to "impossible",
the stronger the evidence against the null hypothesis in favor of the
alternative.  The point is that we should go directly to the p-value.
All the confusing material about picking a level and calculating the
rejection region is there because we couldn't calculate that
probability when I took an introductory course more than 40 years ago.
 All we had then were slide rules, pencil and paper, and a few tables
in a book.  We can do better than that now.

Do we need to describe computational formulas in a text book?  It
turns out that just about every formula in an introductory text,
beyond the calculation of the sample mean, is not really the way that
the calculation is done.  Most of us know that the "short cut" formula
for the sample variance has bad numerical properties and a few might
know that regression coefficients are not really evaluated by
inverting X'X.  Why teach a formula that is only good for a simplified
situation, like a simple linear regression model?  Why not say that we
minimize the residual sum of squares and leave it at that?  Pay more
attention to model building and examining residuals.

In teaching I think it is important to strive for simplicity and
consistency in the use of R.  Keep the R code as concise as possible.

I prefer to teach lattice graphics because I think the graphics are
informative and because all the lattice functions can be called with a
formula/data pair of arguments, just as t.test, aov, lm, glm, nls,
etc. can be called with formula/data.  I use Sweave and the beamer
LaTeX class to generate the slides for my classes so that I can
extract the R code and make that available on the course web site.
The slides and class presentations describe the graphics calls
succinctly, if at all, but the detailed code is available for
examination if the students want to delve deeper.

In short, the worst way to use R in an introductory course is to teach
the same-old-same-old material augmented with page after page of
confusing R code.  Try to use the power of the computer and the
software to aid insight into data and to simplify the ideas of
statistics.

I have over the years produced slides for classes based first on
Devore's books then on Peter's book and now on the Cohen and Cohen
book.  I am willing to make these available, including the source
code, so others can borrow code or presentation approaches if they
wish.  I am not familiar with open documentation licenses like
Creative Commons.  If it would help to stimulate discussion I will
make them available without copyright.  I would be particularly
interested in corresponding with potential text book authors on some
of the techniques that I think can be used to simplify presentation of
R code and graphics.  I don't have plans to embark on writing a text
myself.