Skip to content

SAS or R software

15 messages · Alexander C Cambon, Henric Nilsson, Jonathan Baron +6 more

#
I apologize for adding this so late to the "SAS or R software " thread.
This is a question, not a reply, but it seems to me to fit in well with
the subject of this thread.

I would like to know anyone's experiences in the following two areas
below.  I should add I have no experience myself in these areas:

1) Migrating from SAS to R in the choice of statistical software used
for FDA  reporting.

 (For example, was there more effort involved in areas of
documentation, revision tracking,  or validation of software codes?)

2) Migrating from SAS to R in the choice of statistical software used
for NIH reporting  (or other US or non-US) government agencies) .

I find myself using R more and more and being continually amazed by its
breadth of capabilities, though I have not tried ordering pizza yet. I
use SAS, S-Plus, and, more recently, R for survival analysis and
recurrent events in clinical trials.

Alex Cambon
Biostatistician
School of Public Health and Information Sciences
University of Louisville
#
Alexander C Cambon wrote:
FDA has no requirements.  They accept Minitab and even accept Excel. 
Requirements are to be a good statistician doing quality reproducible 
work for its own sake.
No issues.

Frank

  
    
#
Alexander C Cambon wrote:
This brings up a question that I have often asked but have never had 
answered.  If someone asks me if R is "validated" I usually respond "by 
whom and for what?".  There seems to be an belief that the FDA validates 
software as acceptable for use in the analysis of data for a submission 
to the FDA.  However I have never met anyone who can describe to me 
exactly what this entails.  So I can't say if R is "validated" because I 
don't know what that means.

As I understand it the FDA does not certify or validate software as 
providing "correct" or acceptable answers.  I have been told that what 
the FDA requires is that the software used to produce the results quoted 
in a submission should be auditable.  That is, the FDA must be able to 
check exactly how the numerical results were produced, should they wish 
to do so.  This can be tricky for proprietary software because typically 
the group making the submission does not have access to the source code 
so there has to be a delicate three-way negotiation on the extent to 
which the software vendor will reveal their source code.  However, 
revealing source code not a difficult issue in the open source world. 
Representatives of the FDA (or anyone else, for that matter) can read 
the source code any time they want to.  In fact they are encouraged to 
do so.

So if the standard is "auditable" I don't think you get much more 
auditable than R is.
#
On Fri, 2004-12-17 at 17:11 -0500, Alexander C Cambon wrote:
You will find that to be a non-issue from the FDA's perspective. This
has been discussed here with some frequency.  If you search the archives
you will find comments from Frank Harrell and others.

The FDA does not and cannot endorse a particular software product. Nor
does it validate any statistical software for a specific purpose. They
do need to be able to reproduce the results, which means they need to
know what software product was used, which version and on what platform,
etc.

The SAS XPORT Transport Format (which is openly defined and documented),
has been used for the transfer of data sets and has been available in
many statistical products.

There have been a variety of activities (CDISC, HL-7, etc) regarding the
electronic submission of data to the FDA. Some additional information is
here:

http://www.fda.gov/cder/regulatory/ersr/default.htm

and here:

http://www.cdisc.org/news/index.html

Any other issues impacting the selection of a particular statistical
application are more likely to be political within your working
environment and FUD. 

As you are likely aware, other statistically relevant issues are
contained in various ICH guidance documents regarding GCP considerations
and principles for clinical trials:

http://www.ich.org/UrlGrpServer.jser?@_ID=475&@_TEMPLATE=272


Keep in mind also that one big advantage R has (in my mind) is the use
of Sweave for the reproducible generation of reports, which to an extent
are self-documenting.
Since the FDA's role with computer software and validation has been
raised before, the following documents cover many of these areas. The
list is not meant to be exhaustive, but should give a flavor in this
domain.

There are specific guidance documents by the FDA pertaining to software
that is contained in a medical device (ie. the firmware in a pacemaker
or medical monitoring equipment) or is used to develop a medical device.
The current guidance in this case is here:

http://www.fda.gov/cdrh/comp/guidance/938.html

Other guidance pertains to 21 CFR 11, which addresses data management
systems used for clinical trials and covers issues such as electronic
signatures, audit trails and the like. A guidance document for that is
here:

http://www.fda.gov/cder/guidance/5667fnl.htm

Keep in mind, from a perspective standpoint, that even MS Excel and
Access can be made to be 21 CFR 11 compliant and there are companies
whose business is focused on just that task.

There is also a general guidance document for computer systems used in
clinical trials here:

http://www.fda.gov/ora/compliance_ref/bimo/ffinalcct.htm

Though it is to be superseded by a draft document here:

http://www.fda.gov/cder/guidance/6032dft.htm
Same here to my knowledge.

As I was typing this, I see Frank just responded.

I also just noted Doug's post, so perhaps some of the above information
will be helpful in clarifying some of his questions as well.

I believe that the above is factually correct, but if someone knows
anything to not be so, please correct me.

HTH,

Marc Schwartz
#
Marc Schwartz said the following on 2004-12-18 01:19:
ICH E9 states that (p. 27):
"The computer software used for data management and statistical analysis 
should be reliable, and documentation of appropriate software testing 
procedures should be available."

Some commercial software vendors (SAS, Insightful, and StatSoft) offer 
white papers stating that their software can work within an 21 CFR Part 
11 compliant system.

http://www.sas.com/industry/pharma/develop/papers.html

http://www.insightful.com/industry/pharm/21cfr_part11_Final.pdf

http://www.statsoft.com/support/whitepapers/pdf/STATISTICA_CFR.pdf

Some commercial vendors (SAS and Insightful) also offers tools for 
validation of the installation and operation of the software. SAS has

http://support.sas.com/documentation/installcenter/common/91/ts1m3/qualification_tools_guide.pdf

and S-PLUS has validate().

As a statistical consultant working within the pharamceutical industry, 
I think that our clients find the white papers being some kind of 
quality seal. It signals that someone has actually thought about the 
issues involved, written a document about it, and even stated that it 
can be done. Of course, there's a lot of FUD going on here. But if our 
lives can be made simpler by producing similar white papers and QA 
tools, why not?

(But for some people, only SAS will do:
Last week we were audited on behalf of a client. One of the specific 
issues discussed were validation and the Part 11 compliance of S-PLUS. 
In this specific trial, data are to be transferred from Oracle Clinical 
-> SAS -> SPLUS, and they auditors were really worried about the first 
and last link of that chain. Finally, they suggested using only SAS... 
And in this particular case, Part 11 is really a non-issue since 
physical records exists (i.e. case report forms) and all final S-PLUS 
output and code will also be stored physically (i.e. print-outs) -- no 
need for electronic signatures here!)
From the introduction (p. 2):
"This document provides guidance about computerized systems that are 
used to create, modify, maintain, archive, retrieve, or transmit 
clinical data required to be maintained and/or submitted to the Food and 
Drug Administration (FDA)"

The `retrieve' part is certainly applicable. If we regard R as 
off-the-shelf software, the guidance says (p. 11):
"For most off-the-shelf software, the design level validation will have 
already been done by the company that wrote the software. Given the 
importance of ensuring valid clinical trial data, FDA suggests that the 
sponsor or contract research organization (CRO) have documentation
(either original validation documents or on-site vendor audit documents) 
of this design level validation by the vendor and would itself have 
performed functional testing (e.g., by use of test data sets) and 
researched known software limitations, problems, and defect corrections. 
Detailed documentation of any additional validation efforts performed by 
the sponsor or CRO will preserve the findings of these efforts.

In the special case of database and spreadsheet software that is: (1) 
purchased off-the-shelf, (2) designed for and widely used for general 
purposes, (3) unmodified, and (4) not being used for direct entry of 
data, the sponsor or contract research organization may not have 
documentation of design level validation. FDA suggests that the sponsor 
or contract research organization perform functional testing (e.g., by 
use of test data sets) and research known software limitations,
problems, and defect corrections.

In the case of off-the-shelf software, we recommend that the following 
be available to the Agency on request:

* A written design specification that describes what the software is 
intended to do and how it is intended to do it;

* A written test plan based on the design specification, including both 
structural and functional analysis; and

* Test results and an evaluation of how these results demonstrate that 
the predetermined design specification has been met."

I think the guidance is quite clear here. We must prove to the FDA, at 
their wish, that the software used is working properly. In order to do 
this, we seem to need documents describing the development process and 
the QA tools used by R Core. An idea of what we'll need may be found in 
the `Computer Systems Validation in Clinical Research - A Practical 
Guide (Edition 1)' at

http://www.acdm.org.uk/public/publications/publications.htm

Especially section 2.4, 5 + subsections, 8 + subsections, and 9.7 + 
subsections seem relevant. (I've ordered the 2nd edition, but it hasn't 
arrived yet.)


Henric
#
Marc Schwartz wrote:
In addition to the excellent points made by Marc, Doug, and Matt, I want 
to expand on the revision tracking point originally raised by Alexander. 
  We use CVS for all pharmaceutical industry work.  Besides allowing two 
statisticians working on each project to mirror each other's data and 
code (for backup when one is out and a pressing question is asked), the 
revision control and commented change tracking of CVS has proven to work 
incredibly well in this arena.

The one area where we use SAS for pharmaceutical industry work is 
running SAS PROC EXPORT to convert data to cvs format for importing with 
the Hmisc package's sasxport.get function (see 
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/SASexportHowto). 
We found that reading binary SAS transport format datasets in R or with 
Stat/Transfer was not reliable enough.  We have a freely available SAS 
macro that runs PROC EXPORT in a loop to get all datasets in a data 
library, with metadata.  That way any SAS exporting errors can be blamed 
on SAS.  Ironically there is a bug in PROC EXPORT.  When a character 
field has an unmatched quote in it, the CSV file can result in an odd 
number of quotes for the field.  sasxport.get checks the number of 
records imported against the number reported by PROC CONTENTS, so this 
problem is easily detected and corrected with emacs.

Note that with literally billions of dollars at their disposal, SAS 
didn't take the time to really write a procedure for PROC EXPORT.  Like 
the R sas.get function, it generates voluminous SAS DATA step code to do 
the work.

Regarding CDISC, the SAS transport format that is now accepted by FDA is 
deficient because there is no place for certain metadata (e.g., units of 
measurement, value labels are remote from the datasets, variable names 
are truncated to 8 characters).  The preferred format for CDISC will 
become XML.
#
Henric Nilsson wrote:
...
That is not clear.  And since FDA allows submissions using Excel, with 
not even an audit trail, and with known major statistical computing 
errors in Excel, I am fairly certain that it is not applicable or at the 
least is not enforced  in any meaningful way.
#
Frank E Harrell Jr said the following on 2004-12-18 15:03:
Perhaps. And I think this is the issue. From the clients' perspective, 
not a single FDA document states that you can use other software than 
SAS. They haven't really thought about the fact that there isn't any FDA 
documents encouraging the use of SAS for statistical analyses.

I don't think that the real problem is convincing regulatory authorities 
that R (or any other (open-source) software for that matter) is 
operating adequately. But clients and auditors seems to reason along the 
lines of "rather being safe than sorry" and "nobody's ever been critized 
for using SAS". From their perspective, when we propose using `some 
other' software they start thinking that it perhaps may jeopardize their 
trial results (and, all to often, "but doesn't FDA require SAS?").

How to fight this? I don't know. Right now I'm thinking, "If you can 
beat 'em, join 'em" and that the way of proving that `some other' 
software works is through having similar documents and tools as the 
commercial vendors.
The general preconception seems to be that neither SAS nor Excel needs 
validation. E.g. the British guideline referenced in my previous email 
states on p. 12 that
"It is generally considered that there is no requirement for validation 
of commercial hardware and established operating systems or for packages 
such as the SAS system, Oracle and MS Excel, as entities in their own 
right. However, most are configurable systems and so need adequate 
control of installation and their configuration parameters."

Luckily for Excel, not a single word about precision and adequacy...


Henric
#
Frank E Harrell Jr wrote:
... much discussion deleted ...
Since you brought up the SAS XPORT data format I have to respond with my 
usual rant about it.

<rant>
When it comes to the SAS XPORT data format those are at best third or 
fourth order deficiencies in the metadata.  The first order deficiency 
in the metadata is that it does not contain the number of records in a 
data set.  In this format a file can contain more than one data set and 
  a data set consists of an unknown number of fixed-length records. 
Because of the potential of more than one data set you can't just read 
to the end of the file or use the file size and the record size to 
calculate the number of records.  You must read through the file 
examining each group of 80 characters (Why 80 characters?  Those of us 
who remember punched cards can tell you why.) and for each such group 
try to determine if this is the beginning of another record in the 
current data set or the beginning of a new data set.  How is the 
beginning of a new data set indicated - by a magic string of characters. 
  What if, either perversely or accidently, this magic string of 
characters were included as a text field at the beginning of a record? 
You wouldn't be able to tell if you have a new record or a new data set.

Even better than that, there are situations in which the number of 
records in a data set is not well-defined due to the requirement of 
padding the last 80 character group with blanks.  (After all when you 
create a punch card deck from your data set you want to get an integer 
number of punched cards.) For example, if you are writing an odd number 
of 40 character records then you must pad the last 80 character group 
with blanks.  When reading this data set how can you distinguish the odd 
number of records padded with blanks from an even number of records in 
which the last record happened to be all blanks?  You can't.

When I first encountered this, I thought that I must not understand the 
format properly.  I thought that SAS (and, through SAS, the FDA) 
couldn't really be using a format in which the number of records in a 
data set can be ambiguous.  This would mean that the operations of 
writing the XPORT data set and reading it are not guaranteed to be 
inverses.  I started reading material on the SAS web site and discovered 
that SAS indeed was aware of this problem and had a solution - users 
should not create data sets that exhibit this abiguity.  That's it. 
Their solution is "don't do that".
</rant>

I think that replacing the SAS XPORT data format with XML will be a step 
forward.
#
Henric Nilsson wrote:
Right.  This reminds me of the worst movie of all time, Plan 9 From 
Outer Space, in which the psychic Creskin closes the movie by saying 
"Can you prove that this DIDN'T happen?".
Yes that is the hurdle.
With the job market for statisticians being excellent, I've often 
wondered why clinical statisticians in industry are so often timid. 
Statisticians need to show strength and stamina, along with good 
teaching skills, on this issue.
This makes me wonder about the British system.  Have they not seen the 
serious calculation errors documented to be in Excel?
Right.  Thanks for your note Henric -Frank

  
    
#
On 19-Dec-04 Frank E Harrell Jr wrote:
Because, I fear (and I don't have good documentation on it but I
do have quite a strong impression), that such is not what their
employers and managers see as their role and function.

Other may wish to comment ...

Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 19-Dec-04                                       Time: 10:00:48
------------------------------ XFMail ------------------------------
#
All good points; in my current organization there seem to be 3 hurdles
that need to be crossed.  Most are internal issues, but all related to
conservative interpretation of Part 11.

1. Qualifications:  installation, operational, and performance.  R
clearly satisfies the first and third, the second perhaps needs
someone in R core or similar (i.e. consultant, etc)  needs to provide
the OQ.

2. Statistical results as derived variables (i.e. data).  If so, then
Part 11 can apply, if not it might not.

3.  Removing the "Open Source" moniker (which gets legal people really
upset) and treating R as quality vendor-supplied code under a novel
licensing scheme which has source available and for which a business
case can be made.  Back in the old days (i.e. when I was in high
school in the 80s), our school mini computers had source for the OSs
available, and for most critical vendor or contractor suppiled
software, we had source.  In fact, it was standard!

Anyway, I'm slowly working on these issues internally.  At somepoint,
there will be a breakthrough at one pharma, making it easier for the
rest.  Right now my issue is how to deal with Clinical QA, the
equivalent group is a nightmare bureaucracy to work through, I'm sure,
at most large pharmas.

best,
-toniy



On Sat, 18 Dec 2004 14:10:40 +0100, Henric Nilsson
<henric.nilsson at statisticon.se> wrote:

  
    
#
R folks:

I appreciate and have learned from the recent "SAS vs R" and "Bad Excel
Calculations" threads. Not only civil, but even at times erudite,
discussion. So I apologize for the lateness of this remark and hope it isn't
redundant or trivial.

To those who may wonder why SAS is so dominant in the clinical arena despite
(better) alternatives: INERTIA. That is:

1) There is a huge infrastructure of SAS code already in place for
regulatory submissions and SAS programmers to maintain and enlarge it. As a
practical matter, it is hard to imagine a large organization simply chucking
this and starting afresh. Clearly, change -- if were to occur at all --
would have to be slow and incremental.

2) From my experience at presentations of recent biostatistics PhD's, for
most, their education continues to promulgate the use of SAS in
clinical/regulatory settings, undoubtedly due to 1).

3) As has already been noted, most existing FDA regulators -- statisticians
and clinicians alike -- are familiar with SAS, and therefore submissions
with other software (like R) might delay or complicate the review process.
We statisticians are not the biggest dogs in this arena, after all.

Reality bites! So R users must persevere.

-- Bert Gunter
#
On Mon, 2004-12-20 at 10:38 -0800, Berton Gunter wrote:
Since the notion of inertia was raised by Bert, for those interested in
at least one theory on the adoption of technology and product life
cycles (if one considers R as a software technology), the book "Crossing
the Chasm" by Geoffrey Moore might be of interest.

The Amazon.com link is:

http://www.amazon.com/exec/obidos/tg/detail/-/0066620023

and a very brief Wikipedia overview is here, with a diagram:

http://en.wikipedia.org/wiki/Crossing_the_Chasm

In many respects, the general and increasing adoption of open source
applications fits the theory well. One might consider the growth of
Linux and more recent specific examples of applications such as Firefox
and Thunderbird as replacements for Internet Explorer and Outlook
Express (anybody see the two page Firefox ad in the New York Times).

The potential impact of this particular theory, with respect to change,
was importantly noted when the National Academy of Sciences' Institute
of Medicine published a book as part of their Health Care Quality
Initiative, calling it "Crossing the Quality Chasm: A New Health System
for the 21st Century":

http://www.iom.edu/focuson.asp?id=8089

Another book, which I think dovetails with Moore's book, is "Only the
Paranoid Survive" by Andy Grove, the Chairman of the Board at Intel. In
some cases, the catalyst for crossing the chasm might be a shift in
marketplace dynamics, which sees a market leader falter when they fail
to effectively react to the shift, enabling a new company, technology or
product to take the leadership position.

Grove calls these situations "strategic inflection points", with a
meaning taken from the mathematical term. If the company properly reacts
to the shift, they experience new positive growth possibly under a
substantially altered business model. If they fail to react, they begin
a slide downhill, possibly to never recover or regain their dominance.

The Amazon.com link for Grove's book is:

http://www.amazon.com/exec/obidos/tg/detail/-/0385483821

HTH,

Marc Schwartz