Skip to content

classical statistics in R

10 messages · Jordan Mayor, Christian A. Parker, Brian Campbell +5 more

#
Hi,

I've just received my copy of Ben Bolker's new book, "Ecological Models
and Data in R". I was a little surprised to see he recommended Sokal and
Rohlf's "Biometry" as an introduction to classical stats. Not because
there's anything wrong with S&R, it's comprehensive and well-written.
My problem with this book is that it's written from the perspective of
filling out tables of sums of squares according to fixed recipes, while
R is geared towards more flexible linear models. Trying to translate the
more complex recipes into R code is not a trivial task.

In response to an email, Ben suggested that Gotelli and Ellison's
"Primer of Ecological Statistics" provides a more modern take on the
subject than S&R. I have to agree, G&E is one of the best intros I've
seen for ecologists. But it doesn't really go very far into the possible
complexities of ANOVA and linear regression, and doesn't specifically
address implementing tests in R.

Ben and I are both curious as to what other r-sig-eco readers think
about this issue. What are the best sources for learning about classical
statistics as implemented in R? S&R has been the standard reference for
quite a while, but it now appears to be dated. Is there a good standard
text that covers the same breadth of material with a modern, R-compatible
approach? Ben also recommended several books by Michael Crawley - any
strong feelings on these, or other suggestions?

Thanks!

Tyler
#
I agree with Jordan and will also throw in Gelman and Hill's "Data 
Analysis Using Regression and Multilevel/Hierarchical Models". Its a 
social science based book but is very relevant to ecologists and 
includes R code (and bugs code).
-Chris
Jordan Mayor wrote:
#
In general, I would not choose a book to learn basic statistics based on
whether it has R content or not.  What's important is to learn the
concepts.  Learning how to use them in a particular software is useful,
but secondary.  If we're careless about this distinction, we risk
falling into habits promoted by most commercial software, where one
points and click without understanding what one is doing.  The risk is
there even in GNU R, as the number of functions and packages keeps
growing to help us save time developing procedures.  There's a balance
to be reached between the help received and intellectual independence.
For classical statistics, many books have long series of editions that
have made them superb with age (like good wine).  Zar's Biostatistical
Analysis is my favorite in this domain, but I enjoyed Sokal & Rolf too.


Seb



On Mon, 10 Nov 2008 16:11:47 -0500,
Brian Campbell <jacarebrazil98 at hotmail.com> wrote:

            

        

        

        

            
Cheers,
#
"Sebastian P. Luque" <spluque at gmail.com>
writes:
That's an important point. I should clarify that, for myself, it's not
so important to have actual R code. But the 'sums of squares' framework
presented in S&R is, or at least appears to be, at odds with the linear
model framework used in R. I would appreciate a reference that takes the
same approach as that used in R, so that I can focus on learning the
statistics.

To use S&R as written, I can read through the examples, and implement
them in low-level R code. This is tedious and inflexible. If I properly
understood the linear modelling approach used in R, I expect I could use
higher-level functions, and wouldn't have to re-implement each variation
of a test from scratch. But there's a conceptual gap between R and S&R
that I'm missing.

Cheers,

Tyler

  
    
#
While I agree with this statement in principal, I disagree in  
practice.  I think one of the challenges of teaching classical (or  
any) ecological stats and analysis of experiments to new students is  
being able to allow them to begin to understand not just the specifics  
and concepts of the methods as quickly as possible, but to also begin  
to confront the real problems of the data analyst.  With a given set  
of data, how will choice of method, violation of assumptions, that  
seagull that ate half of the treatment plots, etc. really affect the  
inferences I can draw?  A large hurdle I've seen in many classes,  
regardless of the package they chose to use, is actually getting  
students to learn and then work with the software.  This has often  
involved whole separate labs or classes that, at worst, can be mere  
exercises in button pushing and far abstracted from the course material.

Written languages, such as SAS and R, take some of that away, of  
course, and having a text that at least features examples in said  
language can more seamlessly integrate the two.  After working through  
Gelman and Hill, I was struck by how the code and the conceptual text  
worked together pretty seamlessly.  At the end, I was able to emerge  
with a working understanding of the concepts of multilevel modeling,  
Bayes, etc.  More importantly, I knew I had a toolset in hand that I  
could always turn to and work with in order to carry forth my data  
analysis.

And, indeed, that _always_ is one of the advantages to R.  It's free.   
You can learn it as an undergrad, keep working with it in grad school,  
end up working at a tiny research station with no money in the middle  
of the North Sea, and you will always be able to use it.  No site  
licensees, etc, needed.  I think this is an enormous practical  
advantage in the long run.  And, indeed, thought it will grow and  
change with time, as it is public domain, there is never any danger of  
it disappearing from the earth, like many favorite canned statistical  
packages of yore.

Hence, why not learn it early, and why not a good book that integrates  
concept and practice?  It's what pleased me so much about G&H as well  
as Ben's book.

Perhaps it is time for a classical statistics book for ecology that  
both emphasized the conceptual meat of the material but is integrated  
with R, allowing students to really explore that meat on their own?
On Nov 10, 2008, at 1:55 PM, Sebastian P. Luque wrote:

            
#
My own approach is to use a fairly standard text and then use R to
illustrate/demo the implementation of the concepts.  My goal is to
balance both a conceptual understanding and an ability to implement in
R.  Of course, during the demo, I tend to tackle the data from a data
modeling frame of mind rather than a cookbook approach.  I have been
using Zar for a number of years, and this year switched to Q&K, and
have been trying to write 'labs' that use the examples from the text
for illustration.

However, I must say that this is done within the context of a 4 hour,
integrated lab/lecture graduate course.  It would be much more
difficult without the integrated lab component.
2 days later
#
Although the subject line is 'classical statistics in R', the
discussion of sums of squares leads me to believe Tyler is looking for
a book on linear models in R .  There are several relevant books on
this page ( http://cran.r-project.org/other-docs.html ), and one that
comes to mind is Julian Faraway's.

Some confusion may come from the fact that historically ANOVA and
linear regression have been treated separately, when they are each
cases of the general linear model.  I don't think it's a bad thing
that ANOVA sums-of-squares tables are still taught -- understanding
how sources of variation are partitioned at various levels, along with
associated degrees of freedom is important, no matter which software
is chosen to ultimately estimate parameters. Also it's worth noting
that even if you learn about linear models from a linear algebra
approach, which is what I am guessing Tyler means by 'the linear
models framework used in R', the code to fit models has little
resemblance to what is taught in a linear models book -- i.e. you
won't see something like solve(t(X) %*% X) %*% t(X) %*% y to estimate
the betas because there are more computationally efficient methods for
matrix inversion and cross-products.

In general, I agree with the earlier statement that it's better to
first learn the statistics and then try to learn how to write the R
code.  My belief is that learning about AN(C)OVA-type analyses via
sums of squares table is effective and not at odds with using R (with
one notable exception -- estimating variance components -- see that
last paragraph of the following post for my take on that:
http://tolstoy.newcastle.edu.au/R/e4/help/08/05/11410.html ).  If you
have gotten a solid grasp on the statistics, and it is the code you
are having trouble with, then hopefully Julian's book or one of the
many others will help to clarrify things.

hope this helps,

Kingsford Jones
On Mon, Nov 10, 2008 at 3:45 PM, tyler <tyler.smith at mail.mcgill.ca> wrote: