Skip to content

Way to handle variable length and numbers of columns using read.table(...)

6 messages · jim holtman, Jason Rupert, Gabor Grothendieck

#
I've got read.table to successfully read in my table of three columns.  Most of the time I will have a set number of rows, but sometime that will be variable and sometimes there will be only be two variables in one row, e.g. 

Time Loc1 Loc2
1 22.33 44.55
2 66.77 88.99
3 222.33344.55
4 66.77 88.99

Is there any way to have read.table handle (1) a variable number of rows, and (2) sometime there are only two variables as shown in Time = 3 above? 

Just curious about how to handle this, and if read.table is the right way to go about or if I should read in all the data and then try to parse it out best I can.  

Thanks again.
_                           
platform       i386-apple-darwin8.11.1     
arch           i386                        
os             darwin8.11.1                
system         i386, darwin8.11.1          
status                                     
major          2                           
minor          8.0                         
year           2008                        
month          10                          
day            20                          
svn rev        46754                       
language       R                           
version.string R version 2.8.0 (2008-10-20)
#
Its not clear exactly what the rules are for this but if we assume
that numbers always end in a decimal plus two digits then
using stapply from the gsubfn package:
+ 1 22.33 44.55
+ 2 66.77 88.99
+ 3 222.33344.55
+ 4 66.77 88.99"
[,1]   [,2]
[1,]  22.33  44.55
[2,]  66.77  88.99
[3,] 222.33 344.55
[4,]  66.77  88.99

See http://gsubfn.googlecode.com and for regular expressions see ?regex
On Mon, May 4, 2009 at 10:20 PM, Jason Rupert <jasonkrupert at yahoo.com> wrote:
#
Jim, 

You guessed it.  There are other "problems" with the data.  Here is a closer representation of the data:
Total time and location 
are listed below.

Time Loc1 Loc2
---------------
1 22.33 44.55
2 66.77 88.99
3 222.33344.55
4 66.77 88.99

Avg. Loc1 = 77.88
Avg. Loc2 = 55.66
Final Time = 4

Right now I am using "nrows" in order to only read Time 1-4 & "skip" to skip over the unusable header info, e.g.

read.table(read.table('clipboard', header=FALSE, fill=TRUE, skip=5, nrows=4)

Unfortunately, sometimes the number of "Time" rows varies, so I need to also account for that.  

Maybe I need to look into what Gabor suggested as well, i.e. library(gsubfn)

Thanks again for any feedback and advice on this one, as the data I receive is out of my control, but I am working with the go get them to fix it as well.
--- On Mon, 5/4/09, jim holtman <jholtman at gmail.com> wrote:

            
As you can see the variable that has two decimal points is read in as a character and cause the whole column to be converted to a factor.  It appears that you have some fixed length fields that are overflowing.  Now you could read in the data and use regular expressions and parse the data; you just have to match on the first part have two decimal place and then extract the rest.  THe question is, is this the only "problems" you have in the data?  If so, parsing it is not hard.
#
The last line should be as follows (as the previous post missed the
time column).
The regular expression says either start from beginning (^) and look
for a string of digits, [0-9]+, or look for digits [0-9]*, a dot [.] and two
more digits [0-9][0-9].  Each time strapply finds such a match
as.numeric is applied to it.  Thus line of input results in a numeric
vector and then we simplify those vectors by rbind'ing them together.
[,1]   [,2]   [,3]
[1,]    1  22.33  44.55
[2,]    2  66.77  88.99
[3,]    3 222.33 344.55
[4,]    4  66.77  88.99


On Mon, May 4, 2009 at 11:04 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote: