Prev 207789 / 398502 Next

RMySQL - Bulk loading data and creating FK links

Gabor Grothendieck

Wed, Jan 27, 2010 5:10 PM

How it represents data internally should not be important as long as
you can do what you want.  SQL is declarative so you just specify what
you want rather than how to get it and invisibly to the user it
automatically draws up a query plan and then uses that plan to get the
result.

On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

sqldf("select * from BOD order by Time desc limit 3")

Exactly. SQL requires use of order by. It knows the order, but it isn't
ordered. Thats not good, but might be fine, depending on what the real goal
is.


"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
news:971536df1001270629w4795da89vb7d77af6e4e8be7f at mail.gmail.com...
On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle <mdowle at mdowle.plus.com>
wrote:

How many columns, and of what type are the columns ? As Olga asked too, it
would be useful to know more about what you're really trying to do.

3.5m rows is not actually that many rows, even for 32bit R. Its depends on
the columns and what you want to do with those columns.

At the risk of suggesting something before we know the full facts, one
possibility is to load the data from flat file into data.table. Use
setkey()
to set your keys. Use tables() to summarise your various tables. Then do
your joins etc all-in-R. data.table has fast ways to do those sorts of
joins (but we need more info about your task).

Alternatively, you could check out the sqldf website. There is an
sqlread.csv (or similar name) which can read your files directly into SQL

read.csv.sql

instead of going via R. Gabor has some nice examples there about that and
its faster.

You use some buzzwords which makes me think that SQL may not be
appropriate
for your task though. Can't say for sure (because we don't have enough
information) but its possible you are struggling because SQL has no row
ordering concept built in. That might be why you've created an increment

In the SQLite database it automatically assigns a self incrementing
hidden column called rowid to each row. e.g. using SQLite via the
sqldf package on CRAN and the BOD data frame which is built into R we
can display the rowid column explicitly by referring to it in our
select statement:

library(sqldf)
BOD

?Time demand
1 ? ?1 ? ?8.3
2 ? ?2 ? 10.3
3 ? ?3 ? 19.0
4 ? ?4 ? 16.0
5 ? ?5 ? 15.6
6 ? ?7 ? 19.8

sqldf("select rowid, * from BOD")

?rowid Time demand
1 ? ? 1 ? ?1 ? ?8.3
2 ? ? 2 ? ?2 ? 10.3
3 ? ? 3 ? ?3 ? 19.0
4 ? ? 4 ? ?4 ? 16.0
5 ? ? 5 ? ?5 ? 15.6
6 ? ? 6 ? ?7 ? 19.8

field? Do your queries include "order by incrementing field"? SQL is not
good at "first" and "last" type logic. An all-in-R solution may well be

In SQLite you can get the top 3 values, say, like this (continuing the
prior example):

sqldf("select * from BOD order by Time desc limit 3")

?Time demand
1 ? ?7 ? 19.8
2 ? ?5 ? 15.6
3 ? ?4 ? 16.0

better, since R is very good with ordered vectors. A 1GB data.table (or
data.frame) for example, at 3.5m rows, could have 76 integer columns, or
38 double columns. 1GB is well within 32bit and allows some space for
working copies, depending on what you want to do with the data. If you
have
38 or less columns, or you have 64bit, then an all-in-R solution *might*
get your task done quicker, depending on what your real goal is.

If this sounds plausible, you could post more details and, if its
appropriate, and luck is on your side, someone might even sketch out how
to
do an all-in-R solution.


"Nathan S. Watson-Haigh" <nathan.watson-haigh at csiro.au> wrote in message
news:4B5FDE1B.10806 at csiro.au...

I have a table (contact) with several fields and it's PK is an auto
increment field. I'm bulk loading data to this table from files which if
successful will be about 3.5million rows (approx 16000 rows per file).
However, I have a linking table (an_contact) to resolve a m:m relationship
between the an and contact tables. How can I retrieve the PK's for the
data
bulk loaded into contact so I can insert the relevant data into
an_contact.

I currently load the data into contact using:
dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)

But I then need to get all the PK's which this dbWriteTable() appended to
the contact table so I can load the data into my an_contact link table. I
don't want to issue a separate INSERT query for each row in dat and then
use MySQLs LAST_INSERT_ID() function....not when I have 3.5million rows
to
insert!

Any pointers welcome,
Nathan

--
--------------------------------------------------------
Dr. Nathan S. Watson-Haigh
OCE Post Doctoral Fellow
CSIRO Livestock Industries
University Drive
Townsville, QLD 4810
Australia

Tel: +61 (0)7 4753 8548
Fax: +61 (0)7 4753 8600
Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thread (11 messages)

Nathan S. Watson-Haigh RMySQL - Bulk loading data and creating FK links Jan 26 Matt Dowle RMySQL - Bulk loading data and creating FK links Jan 27 Gabor Grothendieck RMySQL - Bulk loading data and creating FK links Jan 27 Matt Dowle RMySQL - Bulk loading data and creating FK links Jan 27 Gabor Grothendieck RMySQL - Bulk loading data and creating FK links Jan 27 Matt Dowle RMySQL - Bulk loading data and creating FK links Jan 28 Gabor Grothendieck RMySQL - Bulk loading data and creating FK links Jan 28 Matt Dowle RMySQL - Bulk loading data and creating FK links Jan 28 Gabor Grothendieck RMySQL - Bulk loading data and creating FK links Jan 28 Matt Dowle RMySQL - Bulk loading data and creating FK links Jan 28 Gabor Grothendieck RMySQL - Bulk loading data and creating FK links Jan 28