Value Lookup from File without Slurping

18 messages · Gundala Viswanath, Carlos J. Gil Bellosta, Simon Pickett +7 more

Original

1

18

Gundala Viswanath

Fri, Jan 16, 2009 1:02 AM #

Dear all,

I have a repository file (let's call it repo.txt)
 that contain two columns like this:

# tag  value
AAA    0.2
AAT    0.3
AAC   0.02
AAG   0.02
ATA    0.3
ATT   0.7

Given another query vector

I would like to find the corresponding value for each query above,
yielding:

0.02
0.7

However, I want to avoid slurping whole repo.txt into an object (e.g. hash).
Is there any ways to do that?

The reason I want to do that because repo.txt is very2 large size
(milions of lines,
with tag length > 30 bp),  and my PC memory is too small to keep it.

- Gundala Viswanath
Jakarta - Indonesia

Carlos J. Gil Bellosta

Fri, Jan 16, 2009 1:12 AM #

On Fri, 2009-01-16 at 18:02 +0900, Gundala Viswanath wrote:

Hello,

You can always store your repo.txt into a database, say, SQLite, and
select only the values you want via an SQL query.

Thus, you will prevent loading the full file into memory.

Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com

Wacek Kusnierczyk

Fri, Jan 16, 2009 1:30 AM #

you might try to iteratively read a limited number of line of lines in a
batch using readLines:

# filename, the name of your file
# n, the maximal count of lines to read in a batch
connection = file(filename, open="rt")
while (length(lines <- readLines(con=connection, n=n))) {
   # do your stuff here
}
close(connection)

?file
?readLines

vQ

Gundala Viswanath wrote:

Gabor Grothendieck

Fri, Jan 16, 2009 1:41 AM #

The sqldf package can read a large file to a database without going
through R followed by extracting it.   The package makes it easier
to use RSQLite by setting up the database for you and after extracting
the portion you want removing the database automatically.  You can
specify all this in two lines: one to name the file and one to specify
the extraction using SQL. See the examples in example 6 on the
home page:

http://sqldf.googecode.com#Example_6._File_Input

On Fri, Jan 16, 2009 at 4:12 AM, Carlos J. Gil Bellosta

<cgb at datanalytics.com> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Fri, Jan 16, 2009 2:10 AM #

Hi all,

I want to calculate the number of unique observations of "y" in each level 
of "x" from my data frame "df".

this does the job but it is very slow for this big data frame (159503 rows, 
11 columns).....

group.list <- split(df$y,df$x)
count <- function(x) length(unique(na.omit(x)))
sapply(group.list, count, USE.NAMES=TRUE)

I couldnt find the answer searching for "slow split" and "split time" on 
help forum.

I am running R version 2.2.1, on a machine with 4gb of memory and I'm using 
windows 2000.

thanks in advance,

Simon.







----- Original Message ----- 
From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
To: "Gundala Viswanath" <gundalav at gmail.com>
Cc: "R help" <R-help at stat.math.ethz.ch>
Sent: Friday, January 16, 2009 9:30 AM
Subject: Re: [R] Value Lookup from File without Slurping

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

r at quantide.com

Fri, Jan 16, 2009 2:13 AM #

Something like this should work

library(R.utils)
out = numeric()
qr = c("AAC", "ATT")
n =countLines("test.txt")
file = file("test.txt", "r")
for (i in 1:n){
line = readLines(file, n = 1)
A = strsplit (line, split = " ")[[1]][1]
if(is.element(A, qr)) {
value = as.numeric(strsplit (line, split = " ")[[1]][2])
out = c(out, value)
}
}

You may want to improve execution speed by reading data in chunks 
instead of line by line. Code requires a little modification

Carlos J. Gil Bellosta wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Henrique Dallazuanna

Fri, Jan 16, 2009 2:24 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090116/b97a45d4/attachment-0001.pl>

r at quantide.com

Fri, Jan 16, 2009 2:26 AM #

df = data.frame(x = sample(7:9, 100, rep = T), y = sample(1:5, 100, rep 
= T))
fun = function(x){length(unique(x))}
by(df$x, df$y, fun)

Simon Pickett wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Wacek Kusnierczyk

Fri, Jan 16, 2009 2:43 AM #

if the file is really large, reading it twice may add considerable penalty:

r at quantide.com wrote:

# 1st pass

# 2nd pass

if this is a one-go task, counting the lines does not pay, and why
bother.  if this is a repetitive task, a database-based solution will
probably be a better idea.

vQ

r at quantide.com

Fri, Jan 16, 2009 2:52 AM #

I agree on the database solution.
Database are the rigth tool to solve this kind of problem.
Only consider the start up cost of setting up the database. This could 
be a very time consuming task if someone is not familiar with database 
technology.

Using file() is not a real reading of all the file. This function will 
simply open a connection to the file without reading it.
countLines should do something lile "wc -l" from a bash shell

I would say that if this is a one time job this solution should work 
even thought is not the fastest. In case this job is a repetitive one, 
then a database solution is surely better

A.

Wacek Kusnierczyk wrote:

Søren Højsgaard

Fri, Jan 16, 2009 2:55 AM #

Hi,

R version 2.2.1 is slightly old. You may want to upgrade to the current version, R.2.8.1!!! 

You can for example do

library(doBy)
dd <- data.frame(x=c(1,1,1,2,2,2), y=c(1,1,2, 1,1,1))
summaryBy(y~x, data=dd, FUN=function(x)length(unique(x)))
 
Regards
S?ren


-----Oprindelig meddelelse-----
Fra: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] P? vegne af Simon Pickett
Sendt: 16. januar 2009 11:10
Til: R help
Emne: [R] faster version of split()?

Hi all,

I want to calculate the number of unique observations of "y" in each level of "x" from my data frame "df".

this does the job but it is very slow for this big data frame (159503 rows,
11 columns).....

group.list <- split(df$y,df$x)
count <- function(x) length(unique(na.omit(x))) sapply(group.list, count, USE.NAMES=TRUE)

I couldnt find the answer searching for "slow split" and "split time" on help forum.

I am running R version 2.2.1, on a machine with 4gb of memory and I'm using windows 2000.

thanks in advance,

Simon.







----- Original Message -----
From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
To: "Gundala Viswanath" <gundalav at gmail.com>
Cc: "R help" <R-help at stat.math.ethz.ch>
Sent: Friday, January 16, 2009 9:30 AM
Subject: Re: [R] Value Lookup from File without Slurping

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Wacek Kusnierczyk

Fri, Jan 16, 2009 3:21 AM #

r at quantide.com wrote:

and won't pay if you want to do the lookup just once.

... and wc knows the count of lines in a file without reading it

vQ

Wacek Kusnierczyk

Fri, Jan 16, 2009 3:30 AM #

r at quantide.com wrote:

just for a test:

cat(rep('', 10^7), file='test.txt', fill=1)
library(R.utils)
system.time(countLines('test.txt'))

... and the file is just about 30MB (and it makes no real difference if
it is stuffed with newlines or not).

vQ

Gabor Grothendieck

Fri, Jan 16, 2009 4:09 AM #

On Fri, Jan 16, 2009 at 5:52 AM, r at quantide.com <r at quantide.com> wrote:

Using sqldf as mentioned previously on this thread allows one to use
the SQLite database with no setup at all.  sqldf automatically creates
the database, generates the record layout, loads the file (not going through
R but outside of R so R does not slow it down) and extracts the
portion you want into R issuing the appropriate calls to RSQLite/DBI and
destroying the database afterwards all automatically.  When you
install sqldf it automatically installs RSQLite and the SQLite database
itself so the entire installation is just one line: install.packages("sqldf")

Gundala Viswanath

Fri, Jan 16, 2009 7:49 AM #

Hi Gabor,

Do you mean storing data in "sqldf', doesn't take memory?
For example, I have 3GB data file. with standard R object using read.table()
the object size will explode twice ~6GB. My current 4GB RAM
cannot handle that.

Do you mean with "sqldf", this is not the issue?
Why is that?

Sorry for my naive question.

- Gundala Viswanath
Jakarta - Indonesia



On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck

<ggrothendieck at gmail.com> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Gabor Grothendieck

Fri, Jan 16, 2009 7:54 AM #

Only the portion your extract is ever in R -- the file itself is read
into a database
without ever going through R so your memory requirements correspond to what
you extract, not the size of the file.

On Fri, Jan 16, 2009 at 10:49 AM, Gundala Viswanath <gundalav at gmail.com> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius

Fri, Jan 16, 2009 9:27 AM #

Henrique's solution seems sensible. Another might be:

 > df = data.frame(x = sample(7:9, 10, rep = T), y = sample(1:5, 10,  
rep = T))
 > table(df)
    y
x   1 2 3 4 5
   7 1 0 1 0 2
   8 0 1 0 0 1
   9 0 1 1 2 0

 > rowSums(table(df) >0)
7 8 9
3 2 3


#---------same as Henrique's--------
 > count <- function(x) length(unique(na.omit(x)))
 > with(df, tapply(y, x, count))
7 8 9
3 2 3

David Winsemius

On Jan 16, 2009, at 5:10 AM, Simon Pickett wrote:

> Hi all,
>
> I want to calculate the number of unique observations of "y" in each  
> level of "x" from my data frame "df".
>
> this does the job but it is very slow for this big data frame  
> (159503 rows, 11 columns).....
>
> group.list <- split(df$y,df$x)
> count <- function(x) length(unique(na.omit(x)))
> sapply(group.list, count, USE.NAMES=TRUE)
>
> I couldnt find the answer searching for "slow split" and "split  
> time" on help forum.
>
> I am running R version 2.2.1, on a machine with 4gb of memory and  
> I'm using windows 2000.
>
> thanks in advance,
>
> Simon.
>
>
>
>
>
>
>
> ----- Original Message ----- From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no 
> >
> To: "Gundala Viswanath" <gundalav at gmail.com>
> Cc: "R help" <R-help at stat.math.ethz.ch>
> Sent: Friday, January 16, 2009 9:30 AM
> Subject: Re: [R] Value Lookup from File without Slurping
>
>
>> you might try to iteratively read a limited number of line of lines  
>> in a
>> batch using readLines:
>>
>> # filename, the name of your file
>> # n, the maximal count of lines to read in a batch
>> connection = file(filename, open="rt")
>> while (length(lines <- readLines(con=connection, n=n))) {
>>  # do your stuff here
>> }
>> close(connection)
>>
>> ?file
>> ?readLines
>>
>> vQ
>>
>>
>> Gundala Viswanath wrote:
>>> Dear all,
>>>
>>> I have a repository file (let's call it repo.txt)
>>> that contain two columns like this:
>>>
>>> # tag  value
>>> AAA    0.2
>>> AAT    0.3
>>> AAC   0.02
>>> AAG   0.02
>>> ATA    0.3
>>> ATT   0.7
>>>
>>> Given another query vector
>>>
>>>
>>>> qr <- c("AAC", "ATT")
>>>>
>>>
>>> I would like to find the corresponding value for each query above,
>>> yielding:
>>>
>>> 0.02
>>> 0.7
>>>
>>> However, I want to avoid slurping whole repo.txt into an object  
>>> (e.g. hash).
>>> Is there any ways to do that?
>>>
>>> The reason I want to do that because repo.txt is very2 large size
>>> (milions of lines,
>>> with tag length > 30 bp),  and my PC memory is too small to keep it.
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Fri, Jan 16, 2009 10:43 AM #

Simon Pickett wrote:

wouldn't it do with something like

with(df,table(x, is.na(y)))[,1]

or

with(df, tapply(!is.na(y), x, sum))

?

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907