reading in data with variable length

5 messages · Liaw, Andy, John McHenry, Gabor Grothendieck

Original

1

5

Liaw, Andy

Tue, Dec 6, 2005 7:16 AM #

Use file() connection in conjunction with readLines() and strsplit() should
do it.  I would try to count the number of lines in the file first, and
create a list with that many components, then fill it in.  I believe the
"array of cells" in Matlab is sort of equivalent to a list in R, but that's
beyond my knowledge of Matlab...

Andy

From: John McHenry

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

John McHenry

Tue, Dec 6, 2005 7:34 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20051206/36bcfb8d/attachment.pl

Gabor Grothendieck

Tue, Dec 6, 2005 9:55 AM #

Could you time these and see how each of these do:

# 1
ta.split <- strsplit(ta, split = ",")
ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)]))

# 2
ta0 <- sub("^[^,]*,[^.]*,", "", ta)
ta.num <- lapply(ta0, scan, sep = ",")

# 3 - loop version of #1
n <- length(ta)
ta.split <- strsplit(ta, split = ",")
ta.num <- list(length = n)
for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)])

# 4 - loop version of #2
n <- length(ta)
ta0 <- sub("^[^,]*,[^.]*,", "", ta)
ta.num <- list(length = n)
for(i in 1:n) ta.num[[i]] <- scan(t0[[i])

On 12/6/05, John McHenry <john_d_mchenry at yahoo.com> wrote:

I should have mentioned that I already tried the readLines() approach:

 ta<-readLines("foo.csv")
ptm<-proc.time()
f<-character(length(ta))
for (k in 2:length(ta)) { f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS
(proc.time()-ptm)[3]
[1] 102.75

 on a 62M file, so I'm guessing that on my 1GB files this will be about

 > (102.75*(1000/61))/60

[1] 28.07377

minutes...which is way, way too long.

 I'm new to R but I'm kind of surprised that this problem isn't well known (couldn't find anything after a long hunt).

 As I mentioned, MATLAB does it using textread which makes a call to its dll dataread. The data are read using something like:

 [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', 'delimiter',',', 'bufsize', 1000000, 'headerlines',1);

 which is kind of fscanf-like. data in the above is then a cell array with each cell being the variable-length data.

"Liaw, Andy" <andy_liaw at merck.com> wrote:
 Use file() connection in conjunction with readLines() and strsplit() should
do it. I would try to count the number of lines in the file first, and
create a list with that many components, then fill it in. I believe the
"array of cells" in Matlab is sort of equivalent to a list in R, but that's
beyond my knowledge of Matlab...

Andy

From: John McHenry

I have very large csv files (up to 1GB each of ASCII text).
I'd like to be able to read them directly in to R. The
problem I am having is with the variable length of the data
in each record.

Here's a (simplified) example:

$ cat foo.csv
Name,Start Month,Data
Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114

The records consist of rows with some set comma-separated
fields (e.g. the "Name" & "Start Month" fields in the above)
and then the data follow as a variable-length list of
comma-separated values until a new line is encountered.

Now I can use e.g.

fileName="foo.csv"
ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)

which does the job nicely:

V1 V2 V3 V4 V5 V6 V7 V8 V9
V10 V11 V12 V13 V14 V15 V16 V17
1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
NA NA NA NA NA NA NA NA
2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114


but the problem is with files on the order of 1GB this
either crunches for ever or runs out of memory trying ...
plus having all those NAs isn't too pretty to look at.

(I have a MATLAB version that can read this stuff into an
array of cells in about 3 minutes).

I really want a fast way to read the data part into a list;
that way I can access data in the array of lists containing
the records by doing something ta[[i]]$data.

Ideas?

Thanks,

Jack.


---------------------------------


[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

John McHenry

Tue, Dec 6, 2005 12:04 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20051206/19b8f631/attachment.pl

Gabor Grothendieck

Tue, Dec 6, 2005 2:51 PM #

On 12/6/05, John McHenry <john_d_mchenry at yahoo.com> wrote:

Building on Andy's variation:

n <- length(ta)
ta.sub <- sub("^[^,]*,[^.]*,", "", ta)
ta.con <- textConnection(ta.sub)
out <- replicate(n, scan(ta.con, nlines = 1, sep = ","))
close(ta.con)

Also consider writing ta.sub back out and defining ta.con as a
file connection to that file but testing both would be needed to
determine which is faster.


Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Could you time these and see how each of these do:

# 1
ta.split <- strsplit(ta, split = ",")
ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)]))

# 2
ta0 <- sub("^[^,]*,[^.]*,", "", ta)
ta.num <- lapply(ta0, scan, sep = ",")

# 3 - loop version of #1
n <- length(ta)
ta.split <- strsplit(ta, split = ",")
ta.num <- list(length = n)
for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)])

# 4 - loop version of #2
n <- length(ta)
ta0 <- sub("^[^,]*,[^.]*,", "", ta)
ta.num <- list(length = n)
for(i in 1:n) ta.num[[i]] <- scan(t0[[i])



On 12/6/05, John McHenry wrote:

I should have mentioned that I already tried the readLines() approach:

ta<-readLines("foo.csv")
ptm<-proc.time()
f<-character(length(ta))
for (k in 2:length(ta)) {

f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH
LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS

(proc.time()-ptm)[3]
[1] 102.75

on a 62M file, so I'm guessing that on my 1GB files this will be about

(102.75*(1000/61))/60

[1] 28.07377

minutes...which is way, way too long.

I'm new to R but I'm kind of surprised that this problem isn't well known

(couldn't find anything after a long hunt).

As I mentioned, MATLAB does it using textread which makes a call to its

dll dataread. The data are read using something like:

[name, startMonth, data]=textread(fileName,'%s%n%[^\n]',

'delimiter',',', 'bufsize', 1000000, 'headerlines',1);

which is kind of fscanf-like. data in the above is then a cell array with

each cell being the variable-length data.

"Liaw, Andy" wrote:
Use file() connection in conjunction with readLines() and strsplit()

should

do it. I would try to count the number of lines in the file first, and
create a list with that many components, then fill it in. I believe the
"array of cells" in Matlab is sort of equivalent to a list in R, but

that's

beyond my knowledge of Matlab...

Andy

From: John McHenry

I have very large csv files (up to 1GB each of ASCII text).
I'd like to be able to read them directly in to R. The
problem I am having is with the variable length of the data
in each record.

Here's a (simplified) example:

$ cat foo.csv
Name,Start Month,Data
Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114

The records consist of rows with some set comma-separated
fields (e.g. the "Name" & "Start Month" fields in the above)
and then the data follow as a variable-length list of
comma-separated values until a new line is encountered.

Now I can use e.g.

fileName="foo.csv"
ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)

which does the job nicely:

V1 V2 V3 V4 V5 V6 V7 V8 V9
V10 V11 V12 V13 V14 V15 V16 V17
1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
NA NA NA NA NA NA NA NA
2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114


but the problem is with files on the order of 1GB this
either crunches for ever or runs out of memory trying ...
plus having all those NAs isn't too pretty to look at.

(I have a MATLAB version that can read this stuff into an
array of cells in about 3 minutes).

I really want a fast way to read the data part into a list;
that way I can access data in the array of lists containing
the records by doing something ta[[i]]$data.

Ideas?

Thanks,

Jack.


---------------------------------


[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!

http://www.R-project.org/posting-guide.html




________________________________
Yahoo! Shopping
Find Great Deals on Gifts at Yahoo! Shopping