Efficient way to do a merge in R
Any comments are very welcome,
So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an expert at all: 1. I would try to be more restrictive with the columns used for merge, trying something like m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE ) 2. It may be an option to use match() directly: indices <- match( y$V1, x$V1 ) That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that there is always one record in y matching one record in x. You would still need to write some code to add the corresponding y values to a new column in x. 3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job. Again, no expert advice, just a few ideas! Rgds, Rainer
On Tuesday 04 October 2011 01:01:45 Aur?lien PHILIPPOT wrote:
Dear all,
I am new in R and I have been faced with the following problem, that slows
me down a lot. I am short of ideas to circumvent it. So, any help would be
highly appreciated:
I have 2 dataframes x and y. x is very big (70 million observations),
whereas y is smaller (300000 observations).
All the observations of y are present in x. But y has one additional
variable that I would like to incorporate to the dataframe x.
For instance, imagine they have the following variable names:
colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2",
"V5")
-Since the observations of y are present in x, my strategy was to merge x
and y so that the dataframe x would get the values of the variable V5 for
the observations that are both in x and y.
-So, I did the following:
dat<- merge(x, y, all=TRUE).
On a small example, it works fine. The only problem is that when I apply it
to my big dataframe x, it really take for ever (several days and not done
yet) and I have a very fast computer. So, I don't know whether I should
stop now or keep on waiting.
Does anyone have any idea to perform this operation in a more efficient way
(in terms of computation time)?
In addition, does anyone know how to incoporate some sort of counter in a
program to check what how much work has been done at a given point of time?
Any comments are very welcome,
Thanks,
Best,
Aurelien
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.