Hi,
I was unable to find a bug report for this with a cursory search, but would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
Duplicate column names created by base::merge() when by.x has the same name as a column in y
12 messages · Duncan Murdoch, Scott Ritchie, Gabriel Becker +1 more
Hi Scott,
It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by default)
appended to try to make the names of the result unique.
Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...
Frederick
On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi,
I was unable to find a bug report for this with a cursory search, but would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hi Frederick, I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott
On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
Hi Scott,
It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by default)
appended to try to make the names of the result unique.
Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...
Frederick
On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search, but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
The attached patch.diff will make merge.data.frame() append the suffixes to columns with common names between by.x and names(y). Best, Scott Ritchie
On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Hi Frederick, I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
Hi Scott,
It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by default)
appended to try to make the names of the result unique.
Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...
Frederick
On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search, but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch.diff URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180217/efabe387/attachment.ksh>
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often. Cheers, Frederick
On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the suffixes to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Hi Frederick, I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
Hi Scott,
It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by default)
appended to try to make the names of the result unique.
Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...
Frederick
On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search, but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would want to test this against all packages on CRAN and Bioconductor, and since that test takes a couple of days to run on my laptop, I'd probably never get around to it. There are lots of cases where "I would have done that differently", but most of them are far too much trouble to change now that R is more than 20 years old. And in many cases it will turn out that the way R does it actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the suffixes to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Hi Frederick, I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
Hi Scott,
It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by default)
appended to try to make the names of the result unique.
Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...
Frederick
On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search, but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Thanks Duncan and Frederick, I suspected as much - there doesn't appear to be any reason why conflicts between by.x and names(y) shouldn't and cannot be checked, but I can see how this might be more trouble than its worth given it potentially may break downstream packages (i.e. any cases where this occurs but they expect the name of the key column(s) to remain the same). Best, Scott On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would want to test this against all packages on CRAN and Bioconductor, and since that test takes a couple of days to run on my laptop, I'd probably never get around to it. There are lots of cases where "I would have done that differently", but most of them are far too much trouble to change now that R is more than 20 years old. And in many cases it will turn out that the way R does it actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the suffixes to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote: Hi Frederick,
I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: Hi Scott,
It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by default)
appended to try to make the names of the result unique.
Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...
Frederick
On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search, but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y =
"parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
It seems like there is a way that is backwards compatible-ish in the sense mentioned and still has the (arguably, but a good argument I think) better behavior: if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column is called name and y's 'name' column (not used int he merge) is changed to name.y. Now of course this would still change output, but it would change it to something I think would be better, while retaining the 'merge columns retain their exact names' mechanic as documented. ~G On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Thanks Duncan and Frederick, I suspected as much - there doesn't appear to be any reason why conflicts between by.x and names(y) shouldn't and cannot be checked, but I can see how this might be more trouble than its worth given it potentially may break downstream packages (i.e. any cases where this occurs but they expect the name of the key column(s) to remain the same). Best, Scott On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would want
to
test this against all packages on CRAN and Bioconductor, and since that test takes a couple of days to run on my laptop, I'd probably never get around to it. There are lots of cases where "I would have done that differently", but most of them are far too much trouble to change now that R is more than
20
years old. And in many cases it will turn out that the way R does it actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the
suffixes
to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote: Hi Frederick,
I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: Hi Scott,
It seems like reasonable behavior to me. What result would you
expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by
default)
appended to try to make the names of the result unique. Since the first "name" column was used in merging, leaving both without a suffix seems consistent with the documentation... Frederick On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search,
but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y =
"parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Gabriel Becker, PhD Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]]
Thanks Gabriel, I think your suggested approach is 100% backwards compatible Currently in the case of duplicate column names only the first can be indexed by its name. This will always be the column appearing in by.x, meaning the column in y with the same name cannot be accessed. Appending ".y" (suffixes[2L]) to this column means it can now be accessed, while keeping the current behaviour of making the key columns always accessible by using the names provided to by.x. I've attached a new patch that has this behaviour. Best, Scott
On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
It seems like there is a way that is backwards compatible-ish in the sense mentioned and still has the (arguably, but a good argument I think) better behavior: if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column is called name and y's 'name' column (not used int he merge) is changed to name.y. Now of course this would still change output, but it would change it to something I think would be better, while retaining the 'merge columns retain their exact names' mechanic as documented. ~G On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Thanks Duncan and Frederick, I suspected as much - there doesn't appear to be any reason why conflicts between by.x and names(y) shouldn't and cannot be checked, but I can see how this might be more trouble than its worth given it potentially may break downstream packages (i.e. any cases where this occurs but they expect the name of the key column(s) to remain the same). Best, Scott On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would want
to
test this against all packages on CRAN and Bioconductor, and since that test takes a couple of days to run on my laptop, I'd probably never get around to it. There are lots of cases where "I would have done that differently", but most of them are far too much trouble to change now that R is more than
20
years old. And in many cases it will turn out that the way R does it actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the
suffixes
to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote: Hi Frederick,
I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally
be
applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: Hi Scott,
It seems like reasonable behavior to me. What result would you
expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by
default)
appended to try to make the names of the result unique. Since the first "name" column was used in merging, leaving both without a suffix seems consistent with the documentation... Frederick On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search,
but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y =
"parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Gabriel Becker, PhD Scientist (Bioinformatics) Genentech Research
-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch.diff URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180219/6c467cb5/attachment.ksh>
2 days later
Hi Scott, I think that's a good idea and I tried your patch on my copy of the repository. But it looks to me like the recent patch is identical to the previous one, can you confirm this? Frederick
On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
Thanks Gabriel, I think your suggested approach is 100% backwards compatible Currently in the case of duplicate column names only the first can be indexed by its name. This will always be the column appearing in by.x, meaning the column in y with the same name cannot be accessed. Appending ".y" (suffixes[2L]) to this column means it can now be accessed, while keeping the current behaviour of making the key columns always accessible by using the names provided to by.x. I've attached a new patch that has this behaviour. Best, Scott On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
It seems like there is a way that is backwards compatible-ish in the sense mentioned and still has the (arguably, but a good argument I think) better behavior: if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column is called name and y's 'name' column (not used int he merge) is changed to name.y. Now of course this would still change output, but it would change it to something I think would be better, while retaining the 'merge columns retain their exact names' mechanic as documented. ~G On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Thanks Duncan and Frederick, I suspected as much - there doesn't appear to be any reason why conflicts between by.x and names(y) shouldn't and cannot be checked, but I can see how this might be more trouble than its worth given it potentially may break downstream packages (i.e. any cases where this occurs but they expect the name of the key column(s) to remain the same). Best, Scott On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would want
to
test this against all packages on CRAN and Bioconductor, and since that test takes a couple of days to run on my laptop, I'd probably never get around to it. There are lots of cases where "I would have done that differently", but most of them are far too much trouble to change now that R is more than
20
years old. And in many cases it will turn out that the way R does it actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the
suffixes
to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote: Hi Frederick,
I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally
be
applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: Hi Scott,
It seems like reasonable behavior to me. What result would you
expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have any
common names, these have ?suffixes? (?".x"? and ?".y"? by
default)
appended to try to make the names of the result unique. Since the first "name" column was used in merging, leaving both without a suffix seems consistent with the documentation... Frederick On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory search,
but
would
like clarification if this is intended or unavoidable behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian", "Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y =
"parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Gabriel Becker, PhD Scientist (Bioinformatics) Genentech Research
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hi Frederick,
It looks like I didn't overwrite the patch.diff file after the last edits.
Here's the correct patch (attached and copied below):
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74280)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,14 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still arise,
+ ## apply suffixes to just y - this keeps backwards compatibility
+ ## when referring to by.x in the resulting data.frame
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
Best,
Scott
On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
Hi Scott, I think that's a good idea and I tried your patch on my copy of the repository. But it looks to me like the recent patch is identical to the previous one, can you confirm this? Frederick On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
Thanks Gabriel, I think your suggested approach is 100% backwards compatible Currently in the case of duplicate column names only the first can be indexed by its name. This will always be the column appearing in by.x, meaning the column in y with the same name cannot be accessed. Appending ".y" (suffixes[2L]) to this column means it can now be accessed, while keeping the current behaviour of making the key columns always accessible by using the names provided to by.x. I've attached a new patch that has this behaviour. Best, Scott On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu>
wrote:
It seems like there is a way that is backwards compatible-ish in the
sense
mentioned and still has the (arguably, but a good argument I think)
better
behavior: if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
column
is called name and y's 'name' column (not used int he merge) is
changed to
name.y. Now of course this would still change output, but it would change it to something I think would be better, while retaining the 'merge columns retain their exact names' mechanic as documented. ~G On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Thanks Duncan and Frederick, I suspected as much - there doesn't appear to be any reason why
conflicts
between by.x and names(y) shouldn't and cannot be checked, but I can
see
how this might be more trouble than its worth given it potentially may break downstream packages (i.e. any cases where this occurs but they expect the name of the key column(s) to remain the same). Best, Scott On 18 February 2018 at 11:48, Duncan Murdoch <
murdoch.duncan at gmail.com>
wrote:
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would
hazard
to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example
it
seems like reasonable code might easily assume that a column with
the
same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would
want
to
test this against all packages on CRAN and Bioconductor, and since
that
test takes a couple of days to run on my laptop, I'd probably never
get
around to it. There are lots of cases where "I would have done that differently",
but
most of them are far too much trouble to change now that R is more
than
20
years old. And in many cases it will turn out that the way R does
it
actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the
suffixes
to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <
s.ritchie73 at gmail.com>
wrote: Hi Frederick,
I would expect that any duplicate names in the resulting
data.frame
would have the suffixes appended to them, regardless of whether or not
they
are used as the join key. So in my example I would expect "names.x"
and
"names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not
the
case, I would argue the intent of the suffixes functionality should
equally
be
applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend
to do
the same for merge.data.table in the data.table package where I
initially
encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: Hi Scott,
It seems like reasonable behavior to me. What result would you
expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have
any
common names, these have ?suffixes? (?".x"? and ?".y"? by
default)
appended to try to make the names of the result unique. Since the first "name" column was used in merging, leaving both without a suffix seems consistent with the documentation... Frederick On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory
search,
but
would
like clarification if this is intended or unavoidable
behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian",
"Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y =
"parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
============================================================
=======
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names
still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Gabriel Becker, PhD Scientist (Bioinformatics) Genentech Research
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L])) + names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch.diff URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180221/7e77cbd6/attachment.ksh>
Hi Scott, I tried the new patch and can confirm that it has the advertised behavior on a couple of test cases. I think it makes sense to apply it, because any existing code which refers to a second duplicate data.frame column by name is already broken, while if the reference is by numerical index then changing the column name shouldn't break it. I don't know if you need to update the documentation as part of your patch, or if whoever applies it would be happy to do that. Somebody from R core want to weigh in on this? I attach a file with the test example from your original email as well as a second test case I added with two "by" columns. Thanks, Frederick
On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
Hi Frederick,
It looks like I didn't overwrite the patch.diff file after the last edits.
Here's the correct patch (attached and copied below):
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74280)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,14 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still arise,
+ ## apply suffixes to just y - this keeps backwards compatibility
+ ## when referring to by.x in the resulting data.frame
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
Best,
Scott
On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
Hi Scott, I think that's a good idea and I tried your patch on my copy of the repository. But it looks to me like the recent patch is identical to the previous one, can you confirm this? Frederick On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
Thanks Gabriel, I think your suggested approach is 100% backwards compatible Currently in the case of duplicate column names only the first can be indexed by its name. This will always be the column appearing in by.x, meaning the column in y with the same name cannot be accessed. Appending ".y" (suffixes[2L]) to this column means it can now be accessed, while keeping the current behaviour of making the key columns always accessible by using the names provided to by.x. I've attached a new patch that has this behaviour. Best, Scott On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu>
wrote:
It seems like there is a way that is backwards compatible-ish in the
sense
mentioned and still has the (arguably, but a good argument I think)
better
behavior: if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
column
is called name and y's 'name' column (not used int he merge) is
changed to
name.y. Now of course this would still change output, but it would change it to something I think would be better, while retaining the 'merge columns retain their exact names' mechanic as documented. ~G On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at gmail.com> wrote:
Thanks Duncan and Frederick, I suspected as much - there doesn't appear to be any reason why
conflicts
between by.x and names(y) shouldn't and cannot be checked, but I can
see
how this might be more trouble than its worth given it potentially may break downstream packages (i.e. any cases where this occurs but they expect the name of the key column(s) to remain the same). Best, Scott On 18 February 2018 at 11:48, Duncan Murdoch <
murdoch.duncan at gmail.com>
wrote:
On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would
hazard
to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example
it
seems like reasonable code might easily assume that a column with
the
same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often.
I think you're right. If I were still a member of R Core, I would
want
to
test this against all packages on CRAN and Bioconductor, and since
that
test takes a couple of days to run on my laptop, I'd probably never
get
around to it. There are lots of cases where "I would have done that differently",
but
most of them are far too much trouble to change now that R is more
than
20
years old. And in many cases it will turn out that the way R does
it
actually does make more sense than the way I would have done it. Duncan Murdoch
Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
The attached patch.diff will make merge.data.frame() append the
suffixes
to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <
s.ritchie73 at gmail.com>
wrote: Hi Frederick,
I would expect that any duplicate names in the resulting
data.frame
would have the suffixes appended to them, regardless of whether or not
they
are used as the join key. So in my example I would expect "names.x"
and
"names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not
the
case, I would argue the intent of the suffixes functionality should
equally
be
applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend
to do
the same for merge.data.table in the data.table package where I
initially
encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: Hi Scott,
It seems like reasonable behavior to me. What result would you
expect?
That the second "name" should be called "name.y"?
The "merge" documentation says:
If the columns in the data frames not used in merging have
any
common names, these have ?suffixes? (?".x"? and ?".y"? by
default)
appended to try to make the names of the result unique. Since the first "name" column was used in merging, leaving both without a suffix seems consistent with the documentation... Frederick On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
Hi, I was unable to find a bug report for this with a cursory
search,
but
would
like clarification if this is intended or unavoidable
behaviour:
```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
sex=c("F", "M", "F", "M"),
age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
name=c("Oliver", "Sebastian",
"Kai-lee"),
sex=c("M", "M", "F"),
age=c(5,8,7))
# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```
Output:
```
name sex.x age.x name sex.y age.y
1 Max M 43 Sebastian M 8
2 Qin F 36 Kai-lee F 7
3 Sarah F 41 Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y =
"parent") :
column name ?name? is duplicated in the result
```
Kind Regards,
Scott Ritchie
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
============================================================
=======
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names
still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Gabriel Becker, PhD Scientist (Bioinformatics) Genentech Research
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74264)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,15 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still
arise,
+ ## apply suffixes to these
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[1L]))
+ names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
+ if(nzchar(suffixes[2L])) + names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74280)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,14 @@
}
if(has.common.nms) names(y) <- nm.y
+ ## If by.x %in% names(y) then duplicate column names still arise,
+ ## apply suffixes to just y - this keeps backwards compatibility
+ ## when referring to by.x in the resulting data.frame
+ dupe.keyx <- intersect(nm.by, names(y))
+ if(length(dupe.keyx)) {
+ if(nzchar(suffixes[2L]))
+ names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
+ }
nm <- c(names(x), names(y))
if(any(d <- duplicated(nm)))
if(sum(d) > 1L)
-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ritchie-testcase URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180220/b5b74207/attachment.ksh>