Skip to content

sqldf for Very Large Tab Delimited Files

15 messages · HC, Gabor Grothendieck, jim holtman +1 more

HC
#
Hi All,

I have a very (very) large tab-delimited text file without headers. There
are only 8 columns and millions of rows. I want to make numerous pieces of
this file by sub-setting it for individual stations. Station is given as in
the first column. I am trying to learn and use sqldf package for this but am
stuck in a couple of places.

To simulate my requirement, I have taken iris dataset as an example and have
done the following:
(1) create a tab-delimited file without headers.
(2) read it using read.csv.sql command
(3) write the result of a query, getting first 10 records

Here is the reproducible code that I am trying:
# Text data file
write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE,
col.names=FALSE,row.names = FALSE)
# create an empty database (can skip this step if database already exists)
sqldf("attach myTestdbT as new")
f1<-file("irisNoH.txt")
attr(f1, "file.format") <- list(header=FALSE,sep="\t")
# read into table called irisTab in the mytestdb sqlite database
read.csv.sql("irisNoH.txt", sql = "create table main.irisTab1 as select *
from file", dbname = "mytestdb")
res1<-sqldf("select * from main.irisTab1 limit 10", dbname = "mytestdb")
write.table(res1, "iris10.txt", sep = "\t", quote = FALSE,
col.names=FALSE,row.names = FALSE)

# For querying records of a particular species - unresolved problems
#a1<-"virginica"
#attr(f1, "names") <- c("A1","A2","A3","A4","A5")
#res2<-fn$sqldf("select * from main.irisTab1 where A5 = '$a1'")

In the above, I am not able to:
(1) assign the names to various columns
(2) query for particular value of a column; in this case for particular
species, say virginica
(3) I guess fn$sqldf can do the job but it requires assigning column names

Any help would be most appreciated.

Thanks
HC




--
View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4350555.html
Sent from the R help mailing list archive at Nabble.com.
#
On Wed, Feb 1, 2012 at 11:57 PM, HC <hcatbr at yahoo.co.in> wrote:
Ignoring your iris file for a moment, to query the 5th column (getting
its name via sql rather than via R) we can do this:

library(sqldf)
species <- "virginica"
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3")

Now, sqldf is best used when you are getting the data from R but if
you want to store it in a database and just leave it there then you
might be better off using RSQLite directly like this (the eol = "\r\n"
in the dbWriteTable statement was needed on my Windows system but you
may not need that depending on your platform):


write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE, col.names
= FALSE, row.names = FALSE)

library(sqldf)
library(RSQLite)

con <- dbConnect(SQLite(), dbname = "mytestdb")

dbWriteTable(con, "iris", "irisNoH.txt", sep = "\t", eol = "\r\n")

species <- "virginica"
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3")

dbDisconnect(con)
#
On Thu, Feb 2, 2012 at 3:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
There seems to have been a pasting error here.  The first part was
intended to show how to do this using sqldf and the second using
RSQLite.    Thus the first part was intended to be:

library(sqldf)
species <- "virginica"

# obviously we could just do nms <- names(iris) but to get
# names from database instead
nms <- names(dbGetQuery(con, "select * from iris limit 0"))

# use 5th column
fn$sqldf("select * from iris where `nms[5]` = '$species' limit 3")


and the second part that illustrates RSQLite was ok.  Note that fn$
comes from the gsubfn package which sqldf loads.
HC
#
Hi Gabor,

Thank you very much for your guidance and help.

I could run the following code successfully on a 500 mb test data file. A
snapshot of the data file is attached herewith.

********code start***************
library(sqldf)
library(RSQLite)

iFile<-"Test100.txt"
con <- dbConnect(SQLite(),dbname = "myTest100")
dbWriteTable(con, "TestDB100", iFile, sep = "\t") #, eol = "\r\n")
nms <- names(dbGetQuery(con, "select * from TestDB100 limit 0"))

nRec<-fn$dbGetQuery(con, "select count(*)from TestDB100")
aL1<-1;

while (aL1<=nRec){
res1<-fn$dbGetQuery(con, "select * from (select * from TestDB100 limit
'$aL1',1)")
istn<-res1[1,1]
res1<-fn$dbGetQuery(con, "select * from TestDB100 where `nms[1]` = '$istn'")
icount<-dim(res1)[1]
oFile<-paste(istn,"_Test.txt",sep="")
write.table(res1, oFile, sep = "\t", quote = FALSE, col.names= FALSE,
row.names = FALSE)
aL1<-aL1+icount
}
dbDisconnect(con)
********code end***************

However, the actual data file that I want to handle is about *160 GB*. And
when I use the same above code on that file, it gives following error for
dbWriteTable(con, ...) statement
********error start**************
dbWriteTable(con, "TestDB", iFile, sep = "\t") #, eol = "\r\n")
Error in try({ : RS-DBI driver: (RS_sqlite_getline could not realloc)
[1] FALSE
********error end**************

I am not sure about the reason of this error. Is this due to the big file
size? I understood from sqldf webpage that SQLite can work for even a larger
file than this and is only restricted by the disc space and not RAM. I have
about 400GB free space on the PC I am using, with Windows 7 as the operating
system. I am assuming that the about dbWriteTable command is using the disc
memory only and is not the issue.

In fact this file has been created using MySQLdump and I do not have access
to the original MYSQL database file. 
I want to know the following:
(1)  Am I missing something in the above code that is preventing handling of
this big 160  GB file?
(2)  Should this be handled outside of R, if R is becoming a limitation in
this? And if yes then what is a possible way forward?

Thank you again for your quick response and all the help.
HC
http://r.789695.n4.nabble.com/file/n4353362/Test100.txt Test100.txt 
 





--
View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4353362.html
Sent from the R help mailing list archive at Nabble.com.
#
On Thu, Feb 2, 2012 at 8:07 PM, HC <hcatbr at yahoo.co.in> wrote:
I think its unlikely SQLite could handle a database that large unless
you can divide it into multiple separate databases.  At one time the
SQLite site said it did not handle databases over 1 GB and although I
think that is outdated by more recent versions of SQLite its still
likely true that your size is too large for it.
HC
#
Thank you for indicating that SQLite may not handle a file as big as 160 GB.

Would you know of any utility for *physically splitting *the 160 GB text
file into pieces. And if one can control the splitting at the  end of a
record.

Thank you again.
HC

--
View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4354285.html
Sent from the R help mailing list archive at Nabble.com.
#
On Fri, Feb 3, 2012 at 6:03 AM, HC <hcatbr at yahoo.co.in> wrote:
If they are csv files or similar data files then you could use R or
any scripting language to do that.
#
On Fri, Feb 3, 2012 at 7:37 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Or even the *nix `split` command ...

-steve
#
On Fri, Feb 3, 2012 at 8:08 AM, HC <hcatbr at yahoo.co.in> wrote:
You can process a file N lines at time like this:

con <- file("myfile.dat", "r")
while(length(Lines <- readLines(con, n = N)) > 0) {
  ... whatever...
}
HC
#
Thank you.

The readLines command is working fine and I am able to read 10^6 lines in
one go and write them using the write.table command.

Does this readLines command using a block concept to optimize or goes line
by line?

Steve has mentioned about *nix and split commands. Would there be any speed
benefit as compared to readLines?

Thank you.
HC

--
View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355362.html
Sent from the R help mailing list archive at Nabble.com.
HC
#
Bad news!

The readLines command works fine upto a certain limit. Once a few files have
been written the R program crashes.

I used the following code:
*************************
iFile<-"Test.txt"
con <- file(iFile, "r")

N<-1250000; 
iLoop<-1
 
while(length(Lines <- readLines(con, n = N)) > 0 & iLoop<41) { 
oFile<-paste("Split_",iLoop,".txt",sep="")
  write.table(Lines, oFile, sep = "\t", quote = FALSE, col.names= FALSE,
row.names = FALSE)
  iLoop<-iLoop+1
} 
close(con)
********************

With above N=1.25 million, it wrote 28 files of about 57 mb each. That is a
total of about 1.6 GB and then crashed.
I tried with other values on N and it crashes at about the same place in
terms of total size output, i.e., about 1.6 GB.

Is this due to any limitation of Windows 7, in terms of not having the
pointer after this size?

Your insight would be very helpful.

Thank you.
HC






--
View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355679.html
Sent from the R help mailing list archive at Nabble.com.
#
Exactly what does "crashed" mean?  What was the error message?  How
you tried to put:

rm(Lines)
gc()

at the end of the loop to free up and compact memory?  If you watch
the performance, does the R process seem to be growing in terms of the
amount of memory that is being used?  You can add:

memory.size()

before the above statements to see how much memory is being used.
This is just some more elementary debugging that you will have to
learn when using any system.
On Fri, Feb 3, 2012 at 3:22 PM, HC <hcatbr at yahoo.co.in> wrote:

  
    
HC
#
Thank you Jim for your reply.

I could figure out that readLines works fine until 35,841,335 lines
(records). 

When the next line is read to be read, a window with "R for Windows GUI
front-end has stopped working" message comes, with an option to close
program or checking online for a solution.

The tab-separated .txt file I am working on has 9 columns and about 3.25
billion rows. I suspect that the 35,841,336th line is becoming so very big
that RAM out runs its capacity. Perhaps all tabs that are expected after
each values are missing, making the line so very big, I am not sure.

Is there any way to skip this line with readLines or any other function? I
am only hoping that the data after this bad line is good and I can read them
if I can skip the bad one. 

Thank you.
HC

--
View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4357730.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi,
On Fri, Feb 3, 2012 at 1:12 PM, HC <hcatbr at yahoo.co.in> wrote:
It's easy to test, right? Just give it a shot and see ...

-steve