Skip to content

Validation of R

4 messages · Jim_Garrett@bd.com, Douglas Bates, Richard Rowe +1 more

#
Like the original poster, I'm in a corporation that interacts with the FDA
(submissions for product approval, and potential for auditing of QC
procedures).  I fully expect to be asked to validate R, in some sense,
within the year, maybe two.  I have two main comments.

First, I would be interested in participating in a small sub-project
interested in exploring this in very practical ways, such as
1.   Documenting resistance or regulatory needs R users are encountering in
this environment, offline from the r-help list,
2.   Sharing experiences (what works and what doesn't for assuaging
managers' fears), and
3.   If any further validation activities are deemed helpful (such as
additional test cases and describing what the test cases are intended to
test), making sure that these activities are fed back to the R project in a
way that others can leverage them in the future.

If you would also like to participate in this off-line discussion, I will
be happy to collect names and e-mails.  Or, if anyone has other ideas or
feels motivated to drive something, feel free to step forward.

Second, just minutes ago I raised this question with our software testor
over lunch.  She tests SAS code used to generate reports of clinical trial
results, and other software used to get clinical data into a database.  In
retrospect she is a biased sample (of size 1!) because the open-source
software model de-emphasizes the role (and value) of the professional
software testor; nonetheless I thought her comments offer a taste of the
opposition some may encounter.  I'll tell you what she said, and then I'll
offer my impressions; please don't argue with her points, because I already
did!

(A bit of background:  we have chosen not to validate SAS procedures, and
we say so in our test documentation.  In practice, I think our clinical
reporting rarely strays far from base SAS--99% of our reporting is just
manipulating and tabulating data--and that may be a reason for the
decision.)

In a nutshell, she thought SAS was more trustworthy than R (to the extent
that she thought we should test R's functions) based on two points:
1.   SAS has a team of professional software testors who spend their time
coming up with test cases that are as esoteric and odd as they can think of
(within the limits of their specifications).  She was not convinced that a
large community of users is sufficient to flush out obscure bugs.  In her
view (not surprisingly), software testors will look at software with a
unique eye.  (Which I think is true--but an army of users also does pretty
well.)
2.   SAS has a long history of quality, and their market niche requires
them to pay close attention to quality.  This distinguishes them from
Microsoft, which has little financial incentive to pay close attention to
quality, and does not have a history of quality despite a large group of
professional software testors.

She and I agreed that if one must know for certain that a particular
function works, one must test it or find documentation indicating precisely
how someone else tested it.  Fortunately R packages come with test cases,
but they're not usually test cases designed to check a large number of
possible failure mechanisms.

My take on this is as follows:
1.   There seem to be two varieties of validation involved here.  The first
provides clear assurance that a specific application does a specific thing.
This is what software validation should really be, and no software, not
even SAS, is above this.  Then there is "warm and fuzzy" validation that
offers limited assurance that the software is generally of good quality.
This is subjective, a matter of reputation, and there is no testing or
documentation that can definitively address this ill-defined criterion.  A
software package could be excellent, with only one bug, but if your
application hits that bug, you have a problem.
2.   I think this thread is mainly addressing the "warm and fuzzy"
validation model.  R is going to encounter skepticism among people who
haven't been exposed to it before, especially if they also have not been
exposed to other open-source software (OSS).  In my experience, people who
have not been involved in any software development expect corporate support
to lead to quality software ("they have resources!").  We all know this is
a fallacy, but you can't argue it away, you just have to demonstrate the
software.  When they become familiar with it, they'll stop asking for the
warm and fuzzy validation.

If my reading of the situation is correct, then the right response is to
dazzle.  The warm-and-fuzzy validation is really an opportunity for a
software demo.  Demonstrate the functions you're likely to use, especially
(following Dr. Harrell's advice) using simulation.  Then repeat the
simulation but with outliers added, and use robust methods.  Read in a CSV
file from a network drive, create some beautiful plots, save the data in
compressed format and document file size (also document the original CSV's
file size), read the data back into a concurrently-running R process and
show it's the same.  Install a particularly impressive and esoteric package
that's remotely related to your problem and document what it does.
Generate pseudorandom data using three different generators, from a given
seed, and then reproduce the data.  Calculate P(Z <= -20) for Z ~ N(0, 1),
then calculate P(Z > 20) using lower.tail = F.

You will provide only an iota of assurance that a particular future
application will work, but you will have removed all doubt that R is a
serious, rigorous, powerful package.  And that addresses the concerns that
may not be voiced, but are underlying.

-Jim Garrett
Becton Dickinson Diagnostic Systems


**********************************************************************
This message is intended only for the designated recipient(s).  ... {{dropped}}
#
Jim_Garrett at bd.com writes, quoting a software tester at BD.com:
Does she know how many such software testers are actively involved in
testing the accuracy/reliability of statistical procedures in SAS or
is she just assuming that there will be a large number.

People often assume that a commercial software company has legions of
programmers working on program development and testing and frequently
this is not the case.  In a typical software company there are many
more employees working on marketing, customer support, etc. than on
development and testing.

I remember when a person told me that they expected that MathSoft (now
Insightful) would have 'at least a dozen' people working on the
development of lme and nlme.  I knew that the actual number was 0
because Jos? Pinheiro and I wrote and contributed that code and
neither of us work for Insightful.

I'm sure that most informal guesses of the number of professional
software testers working on accuracy/reliability of statistical
procedures in SAS will be overestimates.

I'm surprised that in this discussion of validation no one has quoted
ideas from "The Cathedral and the Bazaar" by Eric Raymond
(http://www.catb.org/~esr/writings/).  He has some very perceptive
observations in that essay including the observation that bug
detection and fixing is one of the few aspects of software development
that can be parallelized (provided, of course, that those detecting
the bugs have access to the sources).  A succinct expression is that
"Given enough eyeballs, all bugs are shallow".

In that sense I think it could be said that there are a lot more
software testers working on R than on any other statistical software
system.

Another important consideration in assessing the reliability of open
source software is that the people who develop this software do so
because they are interested in it, not because it is "just a job".
This makes it much more likely that the person developing open source
software will work on getting it "right" and not just getting it ready
to ship out the door.  A person once asked me why the functions for
probability densities, cumulative distribution functions, and
quantiles in R were demonstrably better than those in commercial
software packages.  I said that it was because we had an unfair
advantage - they just have a bunch of programmers working on their
code and we have Martin (Maechler).  To the other programmers getting
good answers is a job requirement; to Martin getting the best possible
answer is a passion.
#
OK - to hard reality.

R has become mainstream among practitioners BECAUSE IT IS 
GOOD.  Practitioners have been voting with their feet/time for years, but 
with recent publicity the tide is becoming a flood.

At some stage we (as in the R community, not the over-worked core) are 
going to have to do something to 'protect' our members in the commercial 
community (and with the push of 'accountability' and its legions of 
analphabet clerks into the academic/research community soon the rest of us).

I suggest that those interested in 'validation' form a group and set about 
systematically 'validating' R processes.
Prebuilt 'evil' datasets (like Anscombe) and simulation using a range of 
different pseudorandom generators to generate data is probably the best 
way.  I once attended a lecture by John Tukey where he described the 
'tests' a measure should be put through wrt input structures (I remember 
one was characterised as a 'rabbit punch', another as 'knee-in-the-groin'), 
such a repertoire of exercises could be put in place.  Testers need to 
recognise that standard functions can be exposed to the weirdest 
distributions when they are called as an intermediate step in another 
calculation.
Code needs to do exactly what it is documented to do, and to squawk loudly 
when asked to do what it doesn't.

I am sure those who have built so much code would really appreciate getting 
a note from the validating group confirming it has been tested and hasn't 
broken down, or else getting a note documenting exactly where and how code 
doesn't work as expected ...

R is open source.  We all have access to the code.  We could also have open 
source published test datasets and outcomes ... which would actually 
present a challenge to the COTS industry to match.

If it is to happen then someone who feels strongly about this needs to get 
the ball rolling, and there would probably need to be a sympathetic conduit 
into/from the core.

For QA purposes the 'testers' will need to be independent of the 
'producers' ...

Richard Rowe
Senior Lecturer
Department of Zoology and Tropical Ecology, James Cook University
Townsville, Queensland 4811, Australia
fax (61)7 47 25 1570
phone (61)7 47 81 4851
e-mail: Richard.Rowe at jcu.edu.au
http://www.jcu.edu.au/school/tbiol/zoology/homepage.html
2 days later
#
I think there may be some exaggeration of how much effort and co-ordination is
needed in order to "validate" R (at least in a non official sense). The QA tools
already in R are incredible good. What is need is for people to actually use
them. If you make a package with code in the tests directory and in that code
you compare results with known results, and stop() if there is an error, then
the package will fail to build and will produce an error message indicating a
problem. Furthermore, the QA tools for checking documentation are exceptional.
If you make the package interesting enough that others may want to use it, and
submit it to CRAN, then the tests are run as part of the development cycle (I
believe) so the feedback to R core is automatic (although debugging may get
bounced back to you, especially if the problem is your code and not R itself).

For tests which may not be of special interest to others, you can set this up
yourself to run automatically and indicate only when there is a problem. In
addition to the tests in my packages on CRAN I have tests that I run myself for
days. These do simulations, estimations, compare results with known results, or
at least previous results, and do calculations multiple ways to test that
results are the same (for example, that the roots of a state space model are the
same as the roots of an equivalent ARMA model). I run about six hours of these
regularly on Linux and on Solaris with a few R release candidates and try to run
the whole suite at least once before a new release. This does not take any
"hands on time," it just takes computer time. On Linux I start it before going
to work (R 1.7.0beta was being released in the morning, my time) and the main
part is done when I get home. The hands on time is to devise meaningful,
comprehensive tests (and to debug when there are problems).

There may be less work involved in doing (un-official) validation than there is
in advertising how much is actually being done. Perhaps the simplest approach is
for individuals to put together packages of tests with descriptions that explain
the extent of the testing which is done, and then submit the packages to CRAN.

Paul Gilbert
Head Statistician/Statisticien en chef, 
Department of Monetary and Financial Analysis, 
     /D?partement des ?tudes mon?taires et financiers, 
Bank of Canada/Banque du Canada
Richard Rowe wrote: