`merge()` not consistent in how it treats list columns

Sun, Jan 3, 2021 2:14 AM

Hi Gabe,

I wouldn't mind it explicitly failing on the ground that you don't join a
list column on a character column, and I wouldn't mind it succeeding
either, because it's consistent with `c("a", "b") == list("a", "b")`  and
`c("a", "b") %in% list("a", "b")` returning `c(TRUE, TRUE)`. But I feel
strongly that it shouldn't behave differently depending on which data frame
is provided first to the function, and I do think that if we do make it an
error, it is worth making it understandable.

What I did wrong in my real case, to provide context, is compute `df2$id <-
lapply(x, fun)`, which was a mistake, but looked ok when printing, `vapply`
solved the issue, `sapply` would still have been problematic because
`df2$id` would be an emply list for a `x` of length 0.

After correcting my mistake I tried to isolate the error and had trouble
reproducing it with my simple case because I was inverting both data frames
argument. This is how the inconsistency +  cryptic message caused me more
trouble than I think it should have.

Imagine that I can have production code work for years with `merge(df1,
df2)`, maybe not written by me, I change it to `merge(df2, df1)` for some
reason and all breaks loose with `Error in sort.list(bx[m$xi]): 'x' must be
atomic for 'sort.list', method "shell" and "quick"`. If I'm not familiar
with list columns and that they can print just like character columns I
might have a rough day.

Here's another oddity that I think is worth fixing :

df1 <- data.frame(a=1, id = "ID")
df3 <- data.frame(c=character(), id = list())
merge(df3, df1)
#> [1] x[FALSE, ] a          id
#> <0 lignes> (ou 'row.names' de longueur nulle)
merge(df1, df3)
#> [1] a          id         y[FALSE, ]
#> <0 lignes> (ou 'row.names' de longueur nulle)

 [...]

The doc does say that "This is intended to work with data frames with
vector-like columns" in a note at the bottom, so anything we do is
consistent with the doc, and fine by me if it fails (that's how {dplyr}
joins work), but let the order of the data frames not matter. A warning is
another option.

Well yes, one can always say it's the developer's fault, but we all
appreciate a software that guides us toward the light. List columns are not
a rare thing at all anymore and using an `lapply` call instad of `sapply`
or `vapply` is probably not a rare mistake. And again, the inconsistency is
wrong in any case.

I'll read other answers when I get the digest.

Thanks,

Antoine

`merge()` not consistent in how it treats list columns

Thread (4 messages)