Skip to content

Tools For Preparing Data For Analysis

29 messages · Robert Duval, Frank E Harrell Jr, Douglas Bates +15 more

Messages 1–25 of 29

#
As noted on the R-project web site itself ( www.r-project.org ->
Manuals -> R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of the yellow Springer ones)
that says, more briefly, the same thing.
The R Data Import/Export page recommends examples using SAS, Perl,
Python, and Java. It takes a bit of courage to say that ( when you go
to a corporate software web site, you'll never see a page saying "This
is the type of problem that our product is not the best at, here's
what we suggest instead" ). I'd like to provide a few more
suggestions, especially for volunteers who are willing to evaluate new
candidates.

SAS is fine if you're not paying for the license out of your own
pocket. But maybe one reason you're using R is you don't have
thousands of spare dollars.
Using Java for data cleaning is an exercise in sado-masochism, Java
has a learning curve (almost) as difficult as C++.

There are different types of data transformation, and for some data
preparation problems an all-purpose programming language is a good
choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
excellent regular expression facilities.

However, for some types of complex demanding data preparation
problems, an all-purpose programming language is a poor choice. For
example: cleaning up and preparing clinical lab data and adverse event
data - you could do it in Perl, but it would take way, way too much
time. A specialized programming language is needed. And since data
transformation is quite different from data query, SQL is not the
ideal solution either.

There are only three statistical programming languages that are
well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
popular than S for data cleaning.

If you're an R user with difficult data preparation problems, frankly
you are out of luck, because the products I'm about to mention are
new, unknown, and therefore regarded as immature. And while the
founders of these products would be very happy if you kicked the
tires, most people don't like to look at brand new products. Most
innovators and inventers don't realize this, I've learned it the hard
way.

But if you are a volunteer who likes to help out by evaluating,
comparing, and reporting upon new candidates, well you could certainly
help out R users and the developers of the products by kicking the
tires of these products. And there is a huge need for such volunteers.

1. DAP
This is an open source implementation of SAS.
The founder: Susan Bassein
Find it at: directory.fsf.org/math/stats (GNU GPL)

2. PSPP
This is an open source implementation of SPSS.
The relatively early version number might not give a good idea of how
mature the
data transformation features are, it reflects the fact that he has
only started doing the statistical tests.
The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
Also at : directory.fsf.org/math/stats (GNU GPL)

3. Vilno
This uses a programming language similar to SPSS and SAS, but quite unlike S.
Essentially, it's a substitute for the SAS datastep, and also
transposes data and calculates averages and such. (No t-tests or
regressions in this version). I created this, during the years
2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
my opinion. The tarball includes about 100 or so test cases used for
debugging - for logical calculation errors, but not for extremely high
volumes of data.
The maintenance of Vilno has slowed down, because I am currently
(desparately) looking for employment. But once I've found new
employment and living quarters and settled in, I will continue to
enhance Vilno in my spare time.
The founder: that would be me, Robert Wilkins
Find it at: code.google.com/p/vilno ( GNU GPL )
( In particular, the tarball at code.google.com/p/vilno/downloads/list
, since I have yet to figure out how to use Subversion ).


4. Who knows?
It was not easy to find out about the existence of DAP and PSPP. So
who knows what else is out there. However, I think you'll find a lot
more statistics software ( regression , etc ) out there, and not so
much data transformation software. Not many people work on data
preparation software. In fact, the category is so obscure that there
isn't one agreed term: data cleaning , data munging , data crunching ,
or just getting the data ready for analysis.
#
An additional option for Windows users is Micro Osiris

http://www.microsiris.com/

best
robert
On 6/7/07, Robert Wilkins <irishhacker at gmail.com> wrote:
#
Robert Wilkins wrote:
We deal with exactly those kinds of data solely using R.  R is 
exceptionally powerful for data manipulation, just a bit hard to learn. 
  Many examples are at 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf

Frank

  
    
#
On 08-Jun-07 08:27:21, Christophe Pallier wrote:
I want to join in with an enthusiastic "Me too!!". For anything
which has to do with basic checking for the kind of messes that
people can get data into when they "put it on the computer",
I think awk is ideal. It is very flexible (far more so than
many, even long-time, awk users suspect), very transparent
in its programming language (as opposed to say perl), fast,
and with light impact on system resources (rare delight in
these days, when upgrading your software may require upgrading
your hardware).

Although it may seem on the surface that awk is "two-dimensional"
in its view of data (line by line, and per field in a line),
it has some flexible internal data structures and recursive
function capability, which allows a lot more to be done with
the data that have been read in.

For example, I've used awk to trace ancestry through a genealogy,
given a data file where each line includes the identifier of an
individual and the identifiers of its male and female parents
(where known). And that was for pedigree dogs, where what happens
in real life makes Oedipus look trivial.
But then it is a good idea to process the binary file using an
instance of the creating software, to produce a ASCII file (say
in CSV format).
The main thing often useful for data cleaning that awk does
not have is any associated graphics. It is -- by design -- a
line-by-line text-file processor. While, for instance, you
could use awk to accumulate numerical histogram counts, you
would have to use something else to display the histogram.
And for scatter-plots there's probably not much point in
bringing awk into the picture at all (unless a preliminary
filtration of mess is needed anyway).

That being said, though, there can still be a use to extract
data fields from a file for submission to other software.

Another kind of area where awk would not have much to offer
is where, as a part of your preliminary data inspection,
you want to inspect the results of some standard statistical
analyses.

As a final comment, utilities like awk can be used far more
fruitfully on operating systems (the unixoid family) which
incorporate at ground level the infrastructure for "plumbing"
together streams of data output from different programs.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Jun-07                                       Time: 10:43:05
------------------------------ XFMail ------------------------------
#
On 6/7/07, Robert Wilkins <irishhacker at gmail.com> wrote:
Thanks for bringing up this topic.  I think there is definitely a
place for such languages, which I would regard as data-filtering
languages, but I also think that trying to reproduce the facilities in
SAS or SPSS for data analysis is redundant.

Other responses in this thread have mentioned 'little language'
filters like awk, which is fine for those who were raised in the Bell
Labs tradition of programming ("why type three characters when two
character names should suffice for anything one wants to do on a
PDP-11") but the typical field scientist finds this a bit too terse to
understand and would rather write a filter as a paragraph of code that
they have a change of reading and understanding a week later.

Frank Harrell indicated that it is possible to do a lot of difficult
data transformation within R itself if you try hard enough but that
sometimes means working against the S language and its "whole object"
view to accomplish what you want and it can require knowledge of
subtle aspects of the S language.

General scripting languages like Perl, Python and Ruby can certainly
be used for data filtering but that means learning the language and
its idiosyncrasies, and those idiosyncrasies are often exactly the
aspects that would be used to write a filter tersely.  Readability
suffers.  ("Hell is reading someone else's Perl code - purgatory is
reading your own Perl code.")  The very generality of the languages
means there is a lot to learn and understand before you can write
something like a simple filter.

So I do agree that it would be useful to have a language like the SAS
data step (but Open Source, of course) in which to write a data
filter.  I have one suggestion to make - use the R data frame
structure in the form of a .rda file as the binary output format for a
data table.  That way the user can get the best of both worlds by
using a language like Viino to manipulate and rearrange huge data
files then switching to R for the graphics and data analysis.  As a
further enhancement one might provide the ability to take a .rda file
that contains a single data frame and select columns or rows,
including a random sample of the rows, as a filter.

Producing an R data frame may involve passing over the data twice,
once to determine the size of the resulting structure and the second
time to evaluate the data itself.  This would have been a horrific
penalty in the days that SAS and SPSS were developed but not now.
#
I had mentioned exactly the same thing to others and the feedback I got is -
'when you have a hammer, everything will look like a nail'
^_^.
On 6/7/07, Frank E Harrell Jr <f.harrell at vanderbilt.edu> wrote:

  
    
#
Is there an example available of this sort of problematic data that  
requires this kind of data screening and filtering? For many of us,  
this issue would be nice to learn about, and deal with within R. If a  
package could be created, that would be optimal for some of us. I  
would like to learn a tad more, if it were not too much effort for  
someone else to point me in the right direction?
Cheers,
Hank
On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:

            
Dr. Hank Stevens, Assistant Professor
338 Pearson Hall
Botany Department
Miami University
Oxford, OH 45056

Office: (513) 529-4206
Lab: (513) 529-4262
FAX: (513) 529-4243
http://www.cas.muohio.edu/~stevenmh/
http://www.muohio.edu/ecology/
http://www.muohio.edu/botany/

"E Pluribus Unum"
#
Martin Henry H. Stevens sent the following  at 08/06/2007 15:11:
... rest snipped ...

OK, I can't resist that invitation.  I think there are many kinds of
problematic data.  I handle some nasty textish things in perl (and I
loved the purgatory quote) and I'm afraid I do some things in Excel and
some cleaning I can handle in R, but I never enter data directly into R.

However, one very common scenario I have faceda all my working life is
psych data from questionnaires or interviews in low budget work, mostly
student research or routine entry of therapists' data.  Typically you
have an identifier, a date, some demographics and then a lot of item
data.  There's little money (usual zero) involved for data entry and
cleaning but I've produced a lot of good(ish) papers out of this sort of
very low budget work over the last 20 years.  (Right at the other end of
a financial spectrum from the FDA/validated s'ware thread but this is
about validation again!)

The problem I often face is that people are lousy data entry machines
(well, actually, they vary ... enormously) and if they mess up the data
entry we all know how horrible this can be.

SPSS (boo hiss) used to have an excellent "module", actually a
standalone PC/Windoze program, that allowed you to define variables so
they had allowed values and it would refuse to accept out of range or
out of acceptable entries, it also allowed you to create checking rules
and rules that would, in the light of earlier entries, set later values
and not ask about them.  In a rudimentary way you could also lay things
out on the screen so that it paginated where the q'aire or paper data
record did etc.  The final nice touch was that you could define some
variables as invariant and then set the thing so an independent data
entry person could re-enter the other data (i.e. pick up q'aire, see if
ID fits the one showing on screen, if so, enter the rest of the data).
It would bleep and not move on if you entered a value other than that
entered by the first person and you had to confirm that one of you was
right.

That saved me wasted weeks I'm sure on analysing data that turned out to
be awful and I'd love to see someone build something to replace that.

Currently I tend to use (boo hiss) Excel for this as everyone I work
with seems to have it (and not all can install open office and anyway I
haven't had time to learn that properly yet either ...) and I set up
spreadsheets with validation rules set.  That doesn't get the branching
rules and checks (e.g. if male, skip questions about periods, PMT and
pregnancies), or at least, with my poor Excel skills it doesn't.  I just
skip a column to indicate page breaks in the q'aire, and I get, when I
can, two people to enter the data separately and then use R to compare
the two spreadsheets having yanked them into data frames.

I would really, really love someone to develop (and perhaps replace) the
rather buggy edit() and fix() routines (seem to hang on big data frames
in Rcmdr which is what I'm trying to get students onto) with something
that did some or all of what SPSS/DE used to do for me or I bodge now in
Excel.  If any generous coding whiz were willing to do this, I'll try to
alpha and beta test and write help etc.

There _may_ be good open source things out there that do what I need but
something that really integrated into R would be another huge step
forward in being able to phase out SPSS in my work settings and phase in R.

Very best all,

Chris
#
For windows users, EpiData Entry <http://www.epidata.dk/> is an
excellent (free) tool for data entry and documentation.    --Dale
On 6/8/07, Chris Evans <chrishold at psyctc.org> wrote:
#
Dale Steele wrote:
Note that EpiData seems to work well under linux using wine.
Frank
1 day later
#
That can be  elegantly handled in R through R's object oriented programming
by defining a class for the fancy input.  See this post:
  https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
for a simple example of that style.
On 6/9/07, Robert Wilkins <irishhacker at gmail.com> wrote:
This could be readily handled in R using object oriented programming.
You would specify a class for the strange input,
#
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote:
I hadn't heard of Vilno before (except as a variant of "Vilnius").
And it seems remarkably hard to find info about it from a Google
search. The best I've come up with, searching on

  vilno  data

is at
  http://www.xanga.com/datahelper

This is a blog site, apparently with postings by Robert Wilkins.

At the end of the Sunday, September 17, 2006 posting "Tedious
coding at the Pharmas" is a link:

  "I have created a new data crunching programming language."
   http://www.my.opera.com/datahelper

which appears to be totally empty. In another blog article:

  "go to the www.my.opera.com/datahelper site, go to the August 31
   blog article, and there you will find a tarball-file to download,
   called vilnoAUG2006package.tgz"

so again inaccessible; and a google on "vilnoAUG2006package.tgz"
gives a single hit which is simply the same aricle.

In the Xanga blog there are a few examples of tasks which are
no big deal in any programming language (and, relative to their
simplicity, appear a bit cumbersome in "Vilno"). 

I've not seen in the blog any instance of data transformation
which could not be quite easily done in any straigthforward
language (even awk).
That's a fairly daunting description, though indeed not at all
extreme for the sort of data that can arise in practice (and
not just in pharmaceutical investigations). But the complexity
is in the situation, and, whatever language you use, the writing
of the program will involve the writer getting to grips with
the complexity, and the complexity will be present in the code
simply because of the need to accomodate all the special cases,
exceptions and faults that have to be anticipated in "feral" data.

Once these have been anticipated and incorporated in the code,
the actual transformations are again no big deal.

Frankly, I haven't yet seen anything "Vilno" that couldn't be
accomodated in an 'awk' program. Not that I'm advocating awk for
universal use (I'm not that monolithic about it). But I'm using
it as my favourite example of a flexible, capable, transparent
and efficient data filtering language, as far as it goes.


SO: where can one find out more about Vilno, to see what it may
really be capable of that can not be done so easily in other ways?


(As is implicit in many comments in Robert's blog, and indeed also
from many postings to this list over time and undoubtedly well
known to many of us in practice, a lot of the problems with data
files arise at the data gathering and entry stages, where people
can behave as if stuffing unpaired socks and unattributed underwear
randomly into a drawer, and then banging it shut).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07                                       Time: 09:28:10
------------------------------ XFMail ------------------------------
#
(Ted Harding) sent the following  at 10/06/2007 09:28:

... much snipped ...
And they look surprised when pointing a statistician at the chest of
drawers doesn't result in a cut price display worthy of Figleaf (or
Victoria's Secret I think for those of you in N.America) and get them
their degree, doctorate, latest publication ...

Ah me, how wonderfully, wonderfully ... sadly, accurate!

Thanks Ted, great thread and I'm impressed with EpiData that I've
discovered through this. I'd still like something that is even more
integrated with R but maybe some day, if EpiData go fully open source as
I think they are doing ("A full conversion plan to secure this and
convert the software to open-source has been made (See complete
description of license and principles)." at http://www.epidata.dk/ but
the link to http://www.epidata.dk/about.htm doesn't exactly clarify this
I don't think.  But I can hope.)

Thanks, yet again, to everyone who creates and contributes to the R
system and this list: wonderful!

C
#
Douglas Bates wrote:
Actually, I think Frank's point was subtly different: It is *because* of 
the differences in view that it sometimes seems difficult to find the 
way to do something in R that  is apparently straightforward in SAS. 
I.e. the solutions exist and are often elegant, but may require some 
lateral thinking.

Case in point: Finding the first or the last observation for each 
subject when there are multiple records for each subject. The SAS way 
would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
you can compare the subject ID with the one from the previous record, 
working with data that are sorted appropriately.

You can do the same thing in R with a for loop, but there are better 
ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
maybe
do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or 
something involving aggregate(). (The latter approaches generalize 
better to other within-subject functionals like cumulative doses, etc.).

The hardest cases that I know of are the ones where you need to turn one 
record into many, such as occurs in survival analysis with 
time-dependent, piecewise constant covariates. This may require 
"transposing the problem", i.e. for each  interval you find out which 
subjects contribute and with what, whereas the SAS way would be a 
within-subject loop over intervals containing an OUTPUT statement.

Also, there are some really weird data formats, where e.g. the input 
format is different in different records. Back in the 80's where 
punched-card input was still common, it was quite popular to have one 
card with background information on a patient plus several cards 
detailing visits, and you'd get a stack of cards containing both kinds. 
In R you would most likely split on the card type using grep() and then 
read the two kinds separately and merge() them later.
#
On 6/10/07, Ted Harding <ted.harding at nessie.mcc.ac.uk> wrote:

            
Not specifically R-related, but this would make a great fortune.

Sarah
#
Since R is supposed to be a complete programming language, I wonder
why these tools couldn't be implemented in R (unless speed is the
issue). Of course, it's a naive desire to have a single language that
does everything, but it seems that R currently has most of the
functions necessary to do the type of data cleaning described.

For instance, Gabor and Peter showed some snippets of ways to do this
elegantly; my [physical science] data is often not as horrendously
structured so usually I can get away with a program containing this
type of code

txtin <- scan(filename,what="",sep="\n")
filteredList <- lapply(strsplit(txtin,delimiter),FUN=filterfunction)
   # fiteringfunction() returns selected (and possibly transformed
   # elements if present and NULL otherwise
   # may include calls to grep(), regexpr(), gsub(), substring(),...
   # nchar(), sscanf(), type.convert(), paste(), etc.
mydataframe <- do.call(rbind,filteredList)
   # then match(), subset(), aggregate(), etc.

In the case that the file is large, I open a file connection and scan
a single line + apply filterfunction() successively in a FOR-LOOP
instead of using lapply(). Of course, the devil is in the details of
the filtering function, but I believe most of the required text
processing facilities are already provided by R.

I often have tasks that involve a combination of shell-scripting and
text processing to construct the data frame for analysis; I started
out using Python+NumPy to do the front-end work but have been using R
progressively more (frankly, all of it) to take over that portion
since I generally prefer the data structures and methods in R.
--- Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:

            
____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
#
On 10-Jun-07 14:04:44, Sarah Goslee wrote:
I'm not going to object to that!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07                                       Time: 21:18:45
------------------------------ XFMail ------------------------------
#
On 10-Jun-07 19:27:50, Stephen Tucker wrote:
In principle that is certainly true. A couple of comments,
though.

1. R's rich data structures are likely to be superfluous.
   Mostly, at the sanitisation stage, one is working with
   "flat" files (row & column). This straightforward format
   is often easier to handle using simple programs for the
   kind of basic filtering needed, rather then getting into
   the heavier programming constructs of R.

2. As follow-on and contrast at the same time, very often
   what should be a nice flat file with no rough edges is not.
   If there are variable numbers of fields per line, R will
   not handle it straightforwardly (you can force it in,
   but it's more elaborate). There are related issues as well.

a) If someone entering data into an Excel table lets their
   cursor wander outside the row/col range of the table,
   this can cause invisible entities to be planted in the
   extraneous cells. When saved as a CSV, this file then
   has variable numbers of fields per line, and possibly
   also extra lines with arbitrary blank fields.

   cat datafile.csv | awk 'BEGIN{FS=","}{n=NF;print n}'

   will give you the numbers of fields in each line.

   If you further pipe it into | sort -nu you will get
   the distinct field-numbers. If you know (by now) how many
   fields there should be (e.g. 10), then

   cat datafile.csv | awk 'BEGIN{FS=","} (NF != 10){print NR ", " NF}'

   will tell you which lines have the wrong number of fields,
   and how many fields they have. You can similarly count how
   many lines there are (e.g. pipe into wc -l).

b) Poeple sometimes randomly use a blank space or a "." in a
   cell to demote a missing value. Consistent use of either
   is OK: ",," in a CSV will be treated as "NA" by R. The use
   of "." can be more problematic. If for instance you try to
   read the following CSV into R as a dataframe:

   1,2,.,4
   2,.,4,5
   3,4,.,6

   the "." in cols 2 and 3 is treated as the character ".",
   with the result that something complicated happens to
   the typing of the items.

   typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
   sum(D[1,2]) gives a type-error, even though the entry
   is in fact 2. And so on , in various combinations.

   And (as.nmatrix(D)) is of course a matrix of characters.

   In fact, columns 2 and 3 of D are treated as factors!

   for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
   [1] 1
   [1] 2
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 4
   [1] 2
   [1] .
   Levels: . 2 4
   [1] 4
   Levels: . 4
   [1] 5
   [1] 3
   [1] 4
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 6

   This is getting altogether too complicated for the job
   one wants to do!

   And it gets worse when people mix ",," and ",.,"!

   On the other hand, a simple brush with awk (or sed in
   this case) can sort it once and for all, without waking
   the sleeping dogs in R.

I could go on. R undoubtedly has the power, but it can very
quickly get over-complicated for simple jobs.

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07                                       Time: 22:14:35
------------------------------ XFMail ------------------------------
#
An important potential benefit of R solutions shared by awk, sed, ...
is that they provide a reproducible way to  document  exactly how one  
got
from one version of the data to the next.  This  seems to be the main
problem with handicraft methods like editing excel files, it  is too
easy to introduce  new errors that can't be tracked down at later
stages of the analysis.


url:    www.econ.uiuc.edu/~roger                Roger Koenker
email   rkoenker at uiuc.edu                       Department of Economics
vox:    217-333-4558                            University of Illinois
fax:    217-244-6678                            Champaign, IL 61820
On Jun 10, 2007, at 4:14 PM, (Ted Harding) wrote:

            
#
Embarrasingly, I don't know awk or sed but R's code seems to be
shorter for most tasks than Python, which is my basis for comparison.

It's true that R's more powerful data structures usually aren't
necessary for the data cleaning, but sometimes in the filtering
process I will pick out lines that contain certain data, in which case
I have to convert text to numbers and perform operations like
which.min(), order(), etc., so in that sense I like to have R's
vectorized notation and the objects/functions that support it.

As far as some of the tasks you described, I've tried transcribing
them to R. I know you provided only the simplest examples, but even in
these cases I think R's functions for handling these situations
exemplify their usefulness in this step of the analysis. But perhaps
you would argue that this code is too long... In any event it will
still save the trouble of keeping track of an extra (intermediate)
file passed between awk and R.

(1) the numbers of fields in each line equivalent to
in awk

# R equivalent:
nFields <- count.fields("datafile.csv",sep=",")
# or 
nFields <- sapply(strsplit(readLines("datafile.csv"),","),length)

(2) which lines have the wrong number of fields, and how many fields
they have. You can similarly count how many lines there are (e.g. pipe
into wc -l).

# number of lines with wrong number of fields
nWrongFields <- length(nFields[nFields > 10])

# select only first ten fields from each line
# and return a matrix
firstTenFields <- 
  do.call(rbind,
          lapply(strsplit(readLines("datafile.csv"),","),
                 function(x) x[1:10]))

# select only those lines which contain ten fields
# and return a matrix
onlyTenFields <- 
  do.call(rbind,
          lapply(strsplit(readLines("datafile.csv"),","),
                 function(x) if(length(x) <= 10) x else NULL))

(3)
txtC <- textConnection(
"1,2,.,4
2,.,4,5
3,4,.,6")
# using read.csv() specifying na.string argument:
V1 V2 V3 V4
1  1  2 NA  4
2  2 NA  4  5
3  3  4 NA  6

# Of course, read.csv will work only if data is formatted correctly.
# More generally, using readLines(), strsplit(), etc., which are more
# flexible :
+         lapply(strsplit(readLines(txtC),","),
+                type.convert,na.string="."))
     [,1] [,2] [,3] [,4]
[1,]    1    2   NA    4
[2,]    2   NA    4    5
[3,]    3    4   NA    6

(4) Situations where people mix ",," and ",.,"!

# type.convert (and read.csv) will still work when missing values are ",,"
# and ",.," (automatically recognizes "" as NA and through
# specification of 'na.string', can recognize "." as NA)

# If it is desired to convert "." to "" first, this is simple as
# well:

m <- do.call(rbind,
        lapply(strsplit(readLines(txtC),","),
               function(x) gsub("^\\.$","",x)))
[,1] [,2] [,3] [,4]
[1,] "1"  "2"  ""   "4" 
[2,] "2"  ""   "4"  "5" 
[3,] "3"  "4"  ""   "6" 

# then
mode(m) <- "numeric"
# or
m <- apply(m,2,type.convert)
# will give
[,1] [,2] [,3] [,4]
[1,]    1    2   NA    4
[2,]    2   NA    4    5
[3,]    3    4   NA    6
--- Ted.Harding at manchester.ac.uk wrote:

            
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell.
#
Chris Evans wrote:

            
Perhaps what we need is an XML standard for describing record-oriented 
data and its validation? This could then be used to validate a set of 
records and possibly also to build input forms with built-in validation 
for new records.

  You could then write R code that did 'check this data frame against 
this XML description and tell me the invalid rows'. Or Python code.

  This is the kind of thing that is traditionally built using a database 
front-end, but keeping the description in XML means that alternate 
interfaces (web forms, standalone programs using Qt or GTK libraries) 
can be used on the same description set.

  I had a quick search to see if this kind of thing exists already, but 
google searches for 'data entry verification' indicate that I should 
really pay some people in India to do that kind of thing for me...

Barry
3 days later
#
As a tangent to this thread, there is a very relevant
article in the latest issue of the RSS magazine "Significance",
which I have just received:

  Dr Fisher's Casebook
  The trouble with data

Significance, Vol 4 (2007) Issue 2.

Full current contents at

http://www.blackwell-synergy.com/toc/sign/4/2

but unfortunately you can only read any of it by paying
money to Blackwell (unless you're an RSS member).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Jun-07                                       Time: 12:24:46
------------------------------ XFMail ------------------------------