Skip to content

Efficient way to do a merge in R

4 messages · Aurélien PHILIPPOT, Rainer Schuermann, Joshua Wiley +1 more

#
So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an 
expert at all:

1. I would try to be more restrictive with the columns used for merge, trying something like
m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE )

2. It may be an option to use match() directly:
indices <- match( y$V1, x$V1 )
That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that 
there is always one record in y matching one record in x. You would still need to write some code to add the 
corresponding y values to a new column in x.

3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job.

Again, no expert advice, just a few ideas!

Rgds,
Rainer
On Tuesday 04 October 2011 01:01:45 Aur?lien PHILIPPOT wrote:
#
On Tue, Oct 4, 2011 at 12:40 AM, Rainer Schuermann
<rainer.schuermann at gmx.net> wrote:
I think this idea is a good one (though even match could be slow with
70 million observations).  I believe related to the extraction and
assignment methods for data frames, some extra copies of data end up
being made (at least this is my understanding, experts may correct
me), so I would consider possibly using a list (you lose the builtin
data frame checking that all variables are of the same length (same
number of rows), but I think it makes it faster to work with.  If you
know the indices in x where the y values should go and the class of y
(say numeric) then:
tmp <- vector("numeric", 70000000)
tmp[indices] <- y$V5
x$V5 <- tmp
rm(tmp)
gc()
and you're done.  Takes less than a minute to run on my little laptop
(8GB RAM, 1.6ghz dual core, only slightly faster than a netbook).
Not a bad idea for working with large datasets either.

  
    
#
"Joshua Wiley" <jwiley.psych at gmail.com> wrote in message 
news:CANz9Z_KopuwkzB-zxr96PVuLhHf2ZNxNtxSO9xnyhO-_JUMkcQ at mail.gmail.com...
or, the data.table package
http://datatable.r-forge.r-project.org/

Matthew