Skip to content

How do I read multiple rows of different lengths?

10 messages · smcguffee, David Winsemius, Phil Spector

#
Hi,
I have a data file with hundreds of rows, with every first, second, third,
and fourth line representing a set of numbers for row names x, y, fit, and
residuals, respectively. However, any given group of these lines might be
from 10 to 20000 values long. 

When I try 
fits=read.delim2("test")
Error in read.table(file = file, header = header, sep = sep, quote = quote, 
: 
  more columns than column names

When I try
fits=readLines("test")
it reads the data, but doesn't separate it into values:
[1] "1\t30049\t30204\tsegment_4\t35\t."                                                                                                                                             
[2]
"bp\t30049\t30065\t30071\t30114\t30119\t30121\t30126\t30130\t30132\t30134\t30137\t30146\t30151\t30165\t30174\t30204\t"                                                          
[3] "origScore\t4\t1\t1\t2\t1\t1\t2\t2\t2\t6\t5\t2\t1\t2\t1\t2\t"                                                                                                                   
[4]
"fit\t2.15669\t2.20976\t2.22514\t2.25842\t2.25336\t2.25082\t2.24318\t2.23576\t2.23161\t2.22718\t2.21999\t2.19465\t2.17818\t2.12334\t2.08169\t1.91129\t"                         
[5]
"residuals\t1.84331\t-1.20976\t-1.22514\t-0.258424\t-1.25336\t-1.25082\t-0.243182\t-0.235756\t-0.23161\t3.77282\t2.78001\t-0.194654\t-1.17818\t-0.123344\t-1.08169\t0.0887071\t"

Can anyone help me do this?
Thanks,
Sean
#
On Oct 27, 2010, at 1:53 PM, smcguffee wrote:

            
Try adding  ...
  , header=FALSE, quote="")
   .....to the read arguments. (You have no header and the leading and  
trailing double-quotes result in everything getting read in in one  
column.)
David Winsemius, MD
West Hartford, CT
#
fitLines=read.delim("testOut",header=FALSE,quote="")
worked pretty well.
I have a bunch of NA's that I don't want, but at least I can access the data
I do want.
Thank you,
Sean
#
On Oct 27, 2010, at 3:27 PM, smcguffee wrote:

            
David Winsemius, MD
West Hartford, CT
#
On Oct 27, 2010, at 3:27 PM, smcguffee wrote:

            
I think the read.table and cousins read in some modest fraction of the  
data

Probably need to either preprocess with  readLines if you want an  
exact match of width or maybe take a guess with  ... ,  colClasses =  
rep("numeric", 250)  to create a sufficiently wide set of columns to  
hold everything.
#
Using the text itself works, although slightly annoying. I ended up just
processing the text each time I wanted a value inside it, which turns out to
be about the same thing. The key to processing the text ended up being a
command: library("CGIwithR") followed by another command scanText(textLine)
to process each text line. Below is an example where if I change n, I can
look at any section of data. It bewilders me as to why R doesn't come with
scanText without loading some sort of library automatically loaded, but hey,
this eventually worked for me and it only took about a full day to figure it
out. I don't know of any other software that could do the same task without
the same effort in figuring out how to do it. Plus, R is free and whatnot,
so I think it is turning out to be worth the headache of easy things not
being obvious.
fitLines=readLines("testOut")
n=1
x=as.numeric(scanText(fitLines[(n-1)*6+2])[2:length(scanText(fitLines[(n-1)*6+2]))])
y=as.numeric(scanText(fitLines[(n-1)*6+3])[2:length(scanText(fitLines[(n-1)*6+3]))])
fit=as.numeric(scanText(fitLines[(n-1)*6+4])[2:length(scanText(fitLines[(n-1)*6+4]))])
res=as.numeric(scanText(fitLines[(n-1)*6+5])[2:length(scanText(fitLines[(n-1)*6+5]))])
plot(x,y)
lines(x,fit)
legend(x[1],max(y),legend=c(scanText(fitLines[(n-1)*6+6])))
#
Here's how I'd approach the problem. 
First, read the file:
Each different type of line can be identified by the 
string at the beginning of the line, based on these 
patterns.  It seems that you only want the numbers,
so we'll convert them as we extract them:
+    txt = grep(pat,a,value=TRUE)
+    lapply(lapply(strsplit(txt,'\t'),function(x)x[2:length(x)]),as.numeric)
+ }
So to reproduce your x, y, fit, and res:

x = result$x[[1]]
y = result$orig[[1]]
fit = result$fit[[1]]
res = result$res[[1]]

Of course, you can access any of the other results by changing the subscript.

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu
On Wed, 27 Oct 2010, smcguffee wrote: