Skip to content

Value Lookup from File without Slurping

18 messages · Gundala Viswanath, Carlos J. Gil Bellosta, Simon Pickett +7 more

#
Dear all,

I have a repository file (let's call it repo.txt)
 that contain two columns like this:

# tag  value
AAA    0.2
AAT    0.3
AAC   0.02
AAG   0.02
ATA    0.3
ATT   0.7

Given another query vector
I would like to find the corresponding value for each query above,
yielding:

0.02
0.7

However, I want to avoid slurping whole repo.txt into an object (e.g. hash).
Is there any ways to do that?

The reason I want to do that because repo.txt is very2 large size
(milions of lines,
with tag length > 30 bp),  and my PC memory is too small to keep it.

- Gundala Viswanath
Jakarta - Indonesia
#
On Fri, 2009-01-16 at 18:02 +0900, Gundala Viswanath wrote:
Hello,

You can always store your repo.txt into a database, say, SQLite, and
select only the values you want via an SQL query.

Thus, you will prevent loading the full file into memory.

Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com
#
you might try to iteratively read a limited number of line of lines in a
batch using readLines:

# filename, the name of your file
# n, the maximal count of lines to read in a batch
connection = file(filename, open="rt")
while (length(lines <- readLines(con=connection, n=n))) {
   # do your stuff here
}
close(connection)

?file
?readLines

vQ
Gundala Viswanath wrote:
#
The sqldf package can read a large file to a database without going
through R followed by extracting it.   The package makes it easier
to use RSQLite by setting up the database for you and after extracting
the portion you want removing the database automatically.  You can
specify all this in two lines: one to name the file and one to specify
the extraction using SQL. See the examples in example 6 on the
home page:

http://sqldf.googecode.com#Example_6._File_Input

On Fri, Jan 16, 2009 at 4:12 AM, Carlos J. Gil Bellosta
<cgb at datanalytics.com> wrote:
#
Hi all,

I want to calculate the number of unique observations of "y" in each level 
of "x" from my data frame "df".

this does the job but it is very slow for this big data frame (159503 rows, 
11 columns).....

group.list <- split(df$y,df$x)
count <- function(x) length(unique(na.omit(x)))
sapply(group.list, count, USE.NAMES=TRUE)

I couldnt find the answer searching for "slow split" and "split time" on 
help forum.

I am running R version 2.2.1, on a machine with 4gb of memory and I'm using 
windows 2000.

thanks in advance,

Simon.







----- Original Message ----- 
From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
To: "Gundala Viswanath" <gundalav at gmail.com>
Cc: "R help" <R-help at stat.math.ethz.ch>
Sent: Friday, January 16, 2009 9:30 AM
Subject: Re: [R] Value Lookup from File without Slurping
#
Something like this should work

library(R.utils)
out = numeric()
qr = c("AAC", "ATT")
n =countLines("test.txt")
file = file("test.txt", "r")
for (i in 1:n){
line = readLines(file, n = 1)
A = strsplit (line, split = " ")[[1]][1]
if(is.element(A, qr)) {
value = as.numeric(strsplit (line, split = " ")[[1]][2])
out = c(out, value)
}
}

You may want to improve execution speed by reading data in chunks 
instead of line by line. Code requires a little modification
Carlos J. Gil Bellosta wrote:
#
df = data.frame(x = sample(7:9, 100, rep = T), y = sample(1:5, 100, rep 
= T))
fun = function(x){length(unique(x))}
by(df$x, df$y, fun)
Simon Pickett wrote:
#
if the file is really large, reading it twice may add considerable penalty:
r at quantide.com wrote:
# 1st pass
# 2nd pass
if this is a one-go task, counting the lines does not pay, and why
bother.  if this is a repetitive task, a database-based solution will
probably be a better idea.

vQ
#
I agree on the database solution.
Database are the rigth tool to solve this kind of problem.
Only consider the start up cost of setting up the database. This could 
be a very time consuming task if someone is not familiar with database 
technology.

Using file() is not a real reading of all the file. This function will 
simply open a connection to the file without reading it.
countLines should do something lile "wc -l" from a bash shell

I would say that if this is a one time job this solution should work 
even thought is not the fastest. In case this job is a repetitive one, 
then a database solution is surely better

A.
Wacek Kusnierczyk wrote:
#
Hi,

R version 2.2.1 is slightly old. You may want to upgrade to the current version, R.2.8.1!!! 

You can for example do

library(doBy)
dd <- data.frame(x=c(1,1,1,2,2,2), y=c(1,1,2, 1,1,1))
summaryBy(y~x, data=dd, FUN=function(x)length(unique(x)))
 
Regards
S?ren


-----Oprindelig meddelelse-----
Fra: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] P? vegne af Simon Pickett
Sendt: 16. januar 2009 11:10
Til: R help
Emne: [R] faster version of split()?

Hi all,

I want to calculate the number of unique observations of "y" in each level of "x" from my data frame "df".

this does the job but it is very slow for this big data frame (159503 rows,
11 columns).....

group.list <- split(df$y,df$x)
count <- function(x) length(unique(na.omit(x))) sapply(group.list, count, USE.NAMES=TRUE)

I couldnt find the answer searching for "slow split" and "split time" on help forum.

I am running R version 2.2.1, on a machine with 4gb of memory and I'm using windows 2000.

thanks in advance,

Simon.







----- Original Message -----
From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
To: "Gundala Viswanath" <gundalav at gmail.com>
Cc: "R help" <R-help at stat.math.ethz.ch>
Sent: Friday, January 16, 2009 9:30 AM
Subject: Re: [R] Value Lookup from File without Slurping
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
r at quantide.com wrote:
and won't pay if you want to do the lookup just once.
... and wc knows the count of lines in a file without reading it

vQ
#
r at quantide.com wrote:
just for a test:

cat(rep('', 10^7), file='test.txt', fill=1)
library(R.utils)
system.time(countLines('test.txt'))

... and the file is just about 30MB (and it makes no real difference if
it is stuffed with newlines or not).

vQ
#
On Fri, Jan 16, 2009 at 5:52 AM, r at quantide.com <r at quantide.com> wrote:
Using sqldf as mentioned previously on this thread allows one to use
the SQLite database with no setup at all.  sqldf automatically creates
the database, generates the record layout, loads the file (not going through
R but outside of R so R does not slow it down) and extracts the
portion you want into R issuing the appropriate calls to RSQLite/DBI and
destroying the database afterwards all automatically.  When you
install sqldf it automatically installs RSQLite and the SQLite database
itself so the entire installation is just one line: install.packages("sqldf")
#
Hi Gabor,

Do you mean storing data in "sqldf', doesn't take memory?
For example, I have 3GB data file. with standard R object using read.table()
the object size will explode twice ~6GB. My current 4GB RAM
cannot handle that.

Do you mean with "sqldf", this is not the issue?
Why is that?

Sorry for my naive question.

- Gundala Viswanath
Jakarta - Indonesia



On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
#
Only the portion your extract is ever in R -- the file itself is read
into a database
without ever going through R so your memory requirements correspond to what
you extract, not the size of the file.
On Fri, Jan 16, 2009 at 10:49 AM, Gundala Viswanath <gundalav at gmail.com> wrote:
#
Henrique's solution seems sensible. Another might be:

 > df = data.frame(x = sample(7:9, 10, rep = T), y = sample(1:5, 10,  
rep = T))
 > table(df)
    y
x   1 2 3 4 5
   7 1 0 1 0 2
   8 0 1 0 0 1
   9 0 1 1 2 0

 > rowSums(table(df) >0)
7 8 9
3 2 3


#---------same as Henrique's--------
 > count <- function(x) length(unique(na.omit(x)))
 > with(df, tapply(y, x, count))
7 8 9
3 2 3
#
Simon Pickett wrote:
wouldn't it do with something like

with(df,table(x, is.na(y)))[,1]

or

with(df, tapply(!is.na(y), x, sum))

?