Skip to content
Prev 273270 / 398503 Next

Efficient way to do a merge in R

On Tue, Oct 4, 2011 at 12:40 AM, Rainer Schuermann
<rainer.schuermann at gmx.net> wrote:
I think this idea is a good one (though even match could be slow with
70 million observations).  I believe related to the extraction and
assignment methods for data frames, some extra copies of data end up
being made (at least this is my understanding, experts may correct
me), so I would consider possibly using a list (you lose the builtin
data frame checking that all variables are of the same length (same
number of rows), but I think it makes it faster to work with.  If you
know the indices in x where the y values should go and the class of y
(say numeric) then:
tmp <- vector("numeric", 70000000)
tmp[indices] <- y$V5
x$V5 <- tmp
rm(tmp)
gc()
and you're done.  Takes less than a minute to run on my little laptop
(8GB RAM, 1.6ghz dual core, only slightly faster than a netbook).
Not a bad idea for working with large datasets either.