Skip to content
Prev 241638 / 398500 Next

How to Read a Large CSV into a Database with R

On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico <ajdamico at gmail.com> wrote:
OK. Maybe its something else.

The reading in of the file into the database should not be a resource
problem provided you have enough disk space and appropriate
permissions. sqldf / RSQLite are used to get sqlite to do it so that
the data never goes through R at that stage so R limitations can't
affect the reading in to the sqlite database.   When you read it from
the sqlite database then R limitations come into effect so you just
have to be sure not to read too much in at a time.  The use of create
table ... as select ... is to prevent sqldf from deleting the table
since sqldf is normally used in a fashion where you don't want to know
about the back end databases so it tries to create them and delete
them behind the scenes but here you want to explicitly use them so
you have to work around that.

Try this example. It should be reproducible so you just have to copy
it and paste it into your R session.  Uncomment the indicated line if
you want to be able to remove any pre-existing mydb file in the
current directory.  Try it in a fresh R session just to be sure that
nothing mucks it up.

library(sqldf)

# uncomment next line to make sure we are starting clean
# if (file.exists("mydb")) file.remove("mydb")

# create new database
sqldf("attach 'mydb' as new")

# create a new file.  BOD is built into R and has 6 rows.
write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")

# read new file into database
read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
	dbname = "mydb")

# how many records are in table?
N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]

# read in chunks and display what we have read

k <- 4 # no of records to read at once
for(i in seq(0, N-1, k)) {
	s <- sprintf("select * from mytab limit %d, %d", i, k)
	print(sqldf(s, dbname = "mydb"))
}

On my machine I get this output:

  Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
  Time demand
1    5   15.6
2    7   19.8

showing that it read the 6 line BOD data frame in chunks of 4 as required.