Efficient way to do a merge in R

Any comments are very welcome,
So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an 
expert at all:

1. I would try to be more restrictive with the columns used for merge, trying something like
m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE )

2. It may be an option to use match() directly:
indices <- match( y$V1, x$V1 )
That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that 
there is always one record in y matching one record in x. You would still need to write some code to add the 
corresponding y values to a new column in x.

3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job.

Again, no expert advice, just a few ideas!

Rgds,
Rainer
Dear all,
I am new in R and I have been faced with the following problem, that slows
me down a lot.  I am short of ideas to circumvent it. So, any help would be
highly appreciated:

I have 2 dataframes x and y.  x is very big (70 million observations),
whereas y is smaller (300000 observations).
All the observations of y are present in x. But y has one additional
variable that I would like to incorporate to the dataframe x.

For instance, imagine they have the following variable names:
colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2",
"V5")

-Since the observations of y are present in x, my strategy was to merge x
and y so that the dataframe x would get the values of the variable V5 for
the observations that are both in x and y.

-So, I did the following:
dat<- merge(x, y, all=TRUE).

On a small example, it works fine. The only problem is that when I apply it
to my big dataframe x, it really take for ever (several days and not done
yet) and I have a very  fast computer. So, I don't know whether I should
stop now or keep on waiting.

Does anyone have any idea to perform this operation in a more efficient way
(in terms of computation time)?
In addition, does anyone know how to incoporate some sort of counter in a
program to check what how much work has been done at a given point of time?

Any comments are very welcome,
Thanks,

Best,
Aurelien

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Efficient way to do a merge in R

Thread (4 messages)