Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
merging and obtaining the nearest value
11 messages · Francesco, William Dunlap, Rui Barradas
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE,
m$TYPE)), function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names =
c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks
On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello, Yes you can, if you have memory problems, say so and we'll see it then. In the mean time, there's something you should change, to allow for several minima but to only return one per combination of TYPE and DATE. Replace this x[which(min(a) == a), ] by this x[which.min(a), ] Rui Barradas Em 19-08-2012 12:00, Francesco escreveu:
Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thank you very much Rui
On 19 August 2012 13:49, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, Yes you can, if you have memory problems, say so and we'll see it then. In the mean time, there's something you should change, to allow for several minima but to only return one per combination of TYPE and DATE. Replace this x[which(min(a) == a), ] by this x[which.min(a), ] Rui Barradas Em 19-08-2012 12:00, Francesco escreveu:
Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE,
m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For
instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names =
c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
How many different types are there? Just a handful or many thousands? For this sort of problem it is often handy to write a function which generates datasets of the sort you are thinking of but parameterized by the number of rows, levels, etc., so you can see how the execution time varies with these things. If there are just a few types, try looping over types and using findInterval to see where A$Date fits into the sequence of B$Special_Date. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Francesco Sent: Sunday, August 19, 2012 4:01 AM To: r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The following, f2(A,B), should do well with lots of rows in A and B
as long as the number of types is not huge.
f2 <- function(A, B) {
types <- as.character(unique(A$TYPE))
result <- numeric(nrow(A))
Bs <- split(B$Special_Date, B$TYPE)
for(type in types) {
w <- A$TYPE == type
# can omit the sort() below if you know that B$Special_Date is sorted.
result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
}
A$Difference <- A$DATE - result
A
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap Sent: Sunday, August 19, 2012 9:43 AM To: Francesco; r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value How many different types are there? Just a handful or many thousands? For this sort of problem it is often handy to write a function which generates datasets of the sort you are thinking of but parameterized by the number of rows, levels, etc., so you can see how the execution time varies with these things. If there are just a few types, try looping over types and using findInterval to see where A$Date fits into the sequence of B$Special_Date. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Francesco Sent: Sunday, August 19, 2012 4:01 AM To: r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
And the following, f4, uses the same algorithm as f2 but codes
it somewhat more efficiently. It uses the same closestValue()
function.
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of William Dunlap
Sent: Sunday, August 19, 2012 1:49 PM
To: Francesco; r-help at r-project.org
Subject: Re: [R] merging and obtaining the nearest value
The following, f2(A,B), should do well with lots of rows in A and B
as long as the number of types is not huge.
f2 <- function(A, B) {
types <- as.character(unique(A$TYPE))
result <- numeric(nrow(A))
Bs <- split(B$Special_Date, B$TYPE)
for(type in types) {
w <- A$TYPE == type
# can omit the sort() below if you know that B$Special_Date is sorted.
result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
}
A$Difference <- A$DATE - result
A
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap Sent: Sunday, August 19, 2012 9:43 AM To: Francesco; r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value How many different types are there? Just a handful or many thousands? For this sort of problem it is often handy to write a function which generates datasets of the sort you are thinking of but parameterized by the number of rows, levels, etc., so you can see how the execution time varies with these things. If there are just a few types, try looping over types and using findInterval to see where A$Date fits into the sequence of B$Special_Date. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf
Of Francesco Sent: Sunday, August 19, 2012 4:01 AM To: r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You're right, your solution is much faster, but it doesn't remove
duplicates.
When I ran f4() with larger datasets it poduced an error,
Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly
So here they all are.
f1 <- function(A, B){
m <- merge(A, B)
result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)),
function(x){
if(nrow(x)){
a <- abs(x$DATE - x$Special_Date)
x[which.min(a), ] }}) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
B <- B[order(B$TYPE, B$Special_Date), ]
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
# Test data, not many types
nA <- 1e3
nB <- 1e4
set.seed(1)
ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)
aa <- data.frame(TYPE = ta, DATE = da)
bb <- data.frame(TYPE = tb, Special_Date = db)
t1 <- system.time(r1 <- f1(aa, bb))
t4 <- system.time(r4 <- f4(aa, bb))
rbind(t1 = t1, t4 = t4)
sum( duplicated(r4) ) # 165
Rui Barradas
Em 19-08-2012 22:58, William Dunlap escreveu:
And the following, f4, uses the same algorithm as f2 but codes
it somewhat more efficiently. It uses the same closestValue()
function.
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of William Dunlap
Sent: Sunday, August 19, 2012 1:49 PM
To: Francesco; r-help at r-project.org
Subject: Re: [R] merging and obtaining the nearest value
The following, f2(A,B), should do well with lots of rows in A and B
as long as the number of types is not huge.
f2 <- function(A, B) {
types <- as.character(unique(A$TYPE))
result <- numeric(nrow(A))
Bs <- split(B$Special_Date, B$TYPE)
for(type in types) {
w <- A$TYPE == type
# can omit the sort() below if you know that B$Special_Date is sorted.
result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
}
A$Difference <- A$DATE - result
A
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap Sent: Sunday, August 19, 2012 9:43 AM To: Francesco; r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value How many different types are there? Just a handful or many thousands? For this sort of problem it is often handy to write a function which generates datasets of the sort you are thinking of but parameterized by the number of rows, levels, etc., so you can see how the execution time varies with these things. If there are just a few types, try looping over types and using findInterval to see where A$Date fits into the sequence of B$Special_Date. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf
Of Francesco Sent: Sunday, August 19, 2012 4:01 AM To: r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Yes, if the Special_Dates are not sorted then f4 needs to sort them. Perhaps closestValue should just sort its vec argument. I didn't realize that the output should not have any duplicate entries. I thought it should have the same number of rows as the input A. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: Sunday, August 19, 2012 3:54 PM
To: William Dunlap; Francesco
Cc: r-help
Subject: Re: [R] merging and obtaining the nearest value
Hello,
You're right, your solution is much faster, but it doesn't remove
duplicates.
When I ran f4() with larger datasets it poduced an error,
Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly
So here they all are.
f1 <- function(A, B){
m <- merge(A, B)
result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)),
function(x){
if(nrow(x)){
a <- abs(x$DATE - x$Special_Date)
x[which.min(a), ] }}) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
B <- B[order(B$TYPE, B$Special_Date), ]
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
# Test data, not many types
nA <- 1e3
nB <- 1e4
set.seed(1)
ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)
aa <- data.frame(TYPE = ta, DATE = da)
bb <- data.frame(TYPE = tb, Special_Date = db)
t1 <- system.time(r1 <- f1(aa, bb))
t4 <- system.time(r4 <- f4(aa, bb))
rbind(t1 = t1, t4 = t4)
sum( duplicated(r4) ) # 165
Rui Barradas
Em 19-08-2012 22:58, William Dunlap escreveu:
And the following, f4, uses the same algorithm as f2 but codes
it somewhat more efficiently. It uses the same closestValue()
function.
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf
Of William Dunlap
Sent: Sunday, August 19, 2012 1:49 PM
To: Francesco; r-help at r-project.org
Subject: Re: [R] merging and obtaining the nearest value
The following, f2(A,B), should do well with lots of rows in A and B
as long as the number of types is not huge.
f2 <- function(A, B) {
types <- as.character(unique(A$TYPE))
result <- numeric(nrow(A))
Bs <- split(B$Special_Date, B$TYPE)
for(type in types) {
w <- A$TYPE == type
# can omit the sort() below if you know that B$Special_Date is sorted.
result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
}
A$Difference <- A$DATE - result
A
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf
Of William Dunlap Sent: Sunday, August 19, 2012 9:43 AM To: Francesco; r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value How many different types are there? Just a handful or many thousands? For this sort of problem it is often handy to write a function which generates datasets of the sort you are thinking of but parameterized by the number of rows, levels, etc., so you can see how the execution time varies with these things. If there are just a few types, try looping over types and using findInterval to see where A$Date fits into the sequence of B$Special_Date. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf
Of Francesco Sent: Sunday, August 19, 2012 4:01 AM To: r-help at r-project.org Subject: Re: [R] merging and obtaining the nearest value Dear Riu, Many thanks for your suggestion However these are just simplified examples... in reality the dataset A contains millions of observations and B several thousands of rows... Could I still use a modified form of your suggestion? Thanks On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Try the following.
A <- read.table(text="
TYPE DATE
A 2
A 5
A 20
B 10
B 2
", header = TRUE)
B <- read.table(text="
TYPE Special_Date
A 2
A 6
A 20
A 22
B 5
B 6
", header = TRUE)
result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
a <- abs(x$DATE - x$Special_Date)
if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
Also, it's a good practice to post data examples using dput(). For instance,
dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))
Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.
Hope this helps,
Rui Barradas
Em 19-08-2012 11:10, Francesco escreveu:
Dear R-help ? would like to know if there is a short solution in R for this merging problem... Let say I have a dataset A as: TYPE DATE A 2 A 5 A 20 B 10 B 2 (there can be duplicates for the same type and date) and I have another dataset B as : TYPE Special_Date A 2 A 6 A 20 A 22 B 5 B 6 The question is : I would like to obtain the difference between the date of each observation in A and the closest special date in B with the same type. In case of ties I would take the latest date of the two. For example I would obtain here TYPE DATE Difference A 2 0=2-2 A 5 -1=5-6 A 20 0=20-20 B 10 +4=10-6 B 2 -3=2-5 Do you know how to (simply?) obtain this in R? Many thanks! Best Regards
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120820/7344c729/attachment.pl>