Efficient way to do a merge in R

On Tue, Oct 4, 2011 at 12:40 AM, Rainer Schuermann
Any comments are very welcome,
So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an
expert at all:

1. I would try to be more restrictive with the columns used for merge, trying something like
m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE )

2. It may be an option to use match() directly:
indices <- match( y$V1, x$V1 )
That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that
there is always one record in y matching one record in x. You would still need to write some code to add the
corresponding y values to a new column in x.
I think this idea is a good one (though even match could be slow with
70 million observations).  I believe related to the extraction and
assignment methods for data frames, some extra copies of data end up
being made (at least this is my understanding, experts may correct
me), so I would consider possibly using a list (you lose the builtin
data frame checking that all variables are of the same length (same
number of rows), but I think it makes it faster to work with.  If you
know the indices in x where the y values should go and the class of y
(say numeric) then:
tmp <- vector("numeric", 70000000)
tmp[indices] <- y$V5
x$V5 <- tmp
rm(tmp)
gc()
and you're done.  Takes less than a minute to run on my little laptop
(8GB RAM, 1.6ghz dual core, only slightly faster than a netbook).
3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job.
Not a bad idea for working with large datasets either.
Again, no expert advice, just a few ideas!

Rgds,
Rainer

On Tuesday 04 October 2011 01:01:45 Aur?lien PHILIPPOT wrote:
Dear all,
I am new in R and I have been faced with the following problem, that slows
me down a lot. ?I am short of ideas to circumvent it. So, any help would be
highly appreciated:

I have 2 dataframes x and y. ?x is very big (70 million observations),
whereas y is smaller (300000 observations).
All the observations of y are present in x. But y has one additional
variable that I would like to incorporate to the dataframe x.

For instance, imagine they have the following variable names:
colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2",
"V5")

-Since the observations of y are present in x, my strategy was to merge x
and y so that the dataframe x would get the values of the variable V5 for
the observations that are both in x and y.

-So, I did the following:
dat<- merge(x, y, all=TRUE).

On a small example, it works fine. The only problem is that when I apply it
to my big dataframe x, it really take for ever (several days and not done
yet) and I have a very ?fast computer. So, I don't know whether I should
stop now or keep on waiting.

Does anyone have any idea to perform this operation in a more efficient way
(in terms of computation time)?
In addition, does anyone know how to incoporate some sort of counter in a
program to check what how much work has been done at a given point of time?

Any comments are very welcome,
Thanks,

Best,
Aurelien

? ? ? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

Efficient way to do a merge in R

Thread (4 messages)