-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: Dienstag, 4. September 2012 16:58
To: Meyners, Michael
Cc: r-help
Subject: Re: [R] unexpected (?) behavior of sort=TRUE in merge function
Hello,
You're right I had missed the point, sorry.
I can't see a reason why that behavior, but it seems to have to do with
all = T, remove it and the problem is gone. But that's probably not
what you want.
NA's issue?
Rui Barradas
Em 04-09-2012 15:17, Meyners, Michael escreveu:
Rui,
Thanks for looking into this. I apologize, I should've added my
output, maybe it looks differently on my machine than on others. I also
should have made my question more explicit: I'm not looking for a
solution to get the sorting one way or another, I have that already. I
rather want to understand why the same code behaves differently on two
very similar datasets (one just having less rows, see below).
The first call gives the following for me:
lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2",
"G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))
[[1]]
product cong x
1 F -1 5.857143
2 F 0 3.625000
3 F 1 4.782609
4 F 11 6.301887
5 G -1 7.300000
6 G 0 4.800000
7 G 1 4.424242
8 G 11 5.781250
9 K -1 4.375000
10 K 0 4.714286
11 K 1 3.804348
12 K 11 5.566038
13 L -1 7.272727
14 L 0 6.250000
15 L 1 4.875000
16 L 11 6.877551
17 Y1 -1 5.857143
18 Y1 0 3.875000
19 Y1 1 3.535714
20 Y1 11 5.731707
21 Y2 -1 5.900000
22 Y2 0 2.500000
23 Y2 1 4.638889
24 Y2 11 5.419355
[[2]]
product cong x
1 Y1 -1 3.043478
2 Y1 0 4.887640
3 Y1 1 NA
4 Y1 11 NA
5 Y2 -1 4.181818
6 Y2 0 5.207921
7 Y2 1 NA
8 Y2 11 NA
9 G -1 3.750000
10 G 0 5.680000
11 G 1 NA
12 G 11 NA
13 F -1 4.315789
14 F 0 5.705263
15 F 1 NA
16 F 11 NA
17 L -1 4.500000
18 L 0 6.386364
19 L 1 NA
20 L 11 NA
21 K -1 3.739130
22 K 0 4.967033
23 K 1 NA
24 K 11 NA
So different from what you may have observed, here the first data set
[[1]] is sorted by label of "product", not by value. As you correctly
stated, Y1" is coded as 1, "Y2" as 2, etc., but the first rows are for
F, followed by G etc. The second [[2]] is sorted by level (value). So I
have different behavior on very similar looking data sets, and hence to
me at least one of those cannot be "right" according to documentation
(but I agree with you that the second is correct according to the
help). In my larger example, it seems as if data sets which do not
originally have all combinations of product and cong anyway are sorted
like [[2]], and those that are complete (all 24 combinations occur) are
sorted like [[1]] is, which to me is still "unexpected".
Hope this clarifies my question.
Any thoughts appreciated.
Michael
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: Dienstag, 4. September 2012 14:01
To: Meyners, Michael
Cc: r-help
Subject: Re: [R] unexpected (?) behavior of sort=TRUE in merge
function
Hello,
Inline.
Em 04-09-2012 12:24, Meyners, Michael escreveu:
All,
I realize from the archive that the sort argument in merge has been
subject to discussion before, though I couldn't find an explanation
for this behavior. I tried to simplify this to (kind of) minimal
from a real example to the following (and I have no doubts that
are smart people around achieving the same with smarter code :-)).
I'm running R 2.15.1 64bit under MS Windows 7, full session info
I do have a list with two dataframes:
test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L,
6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L,
5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class =
"factor"),
cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286,
5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875,
2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429,
4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875,
3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125,
6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names
c("product", "cong", "x"), row.names = c(NA, -24L), class =
"data.frame"),
structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G",
"F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1,
-1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957,
4.18181818181818,
3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382,
5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364,
4.96703296703297)), .Names = c("product", "cong", "x"),
row.names = c(NA,
-12L), class = "data.frame"))
The dataframes are pretty much the same but for the values in the x-
column and the fact that the second one has only half as many
observations, missing the second half of the expand.grid if you
Now if I run
lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2",
"G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE)) #
sort=TRUE is the default, so could be omitted
sorts the first dataframe according to the labels of factor
No, it doesn't. It sorts according to the columns, i.e., the values,
not according to the labels.
The help page clearly states that the argument 'sort' is "logical.
Should the results be sorted on the by columns?"
And "Y1" is coded as 1, "Y2" as 2, etc. The output is right.
Try the following.
test2 <- test
test2[[1]]$product <- as.character(test[[1]]$product)
test2[[2]]$product <- as.character(test[[2]]$product)
# To make it more readable.
grd <- expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"),
cong=c(-
1,0,1,11))
lapply(test2, function(x) merge(x, grd, all=T, sort=TRUE))
And now 'product' sorts from "F" to "Y2", even if grd$product is
still a factor with the same coding as in 'test'.
Hope this helps,
Rui Barradas
, while for the second one the order is maintained from the first
dataframes (x) to merge (which is the difference that I could not
find being documented). Now I run the same code with sort=FALSE
lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2",
"G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE))
The results are at least consistent and fulfill my needs (this is,
btw, not unexpected from the documentation). Note that I get exactly
the same behavior if I apply merge subsequently to test[[1]] and
test[[2]], so it is not an issue from lapply. (I realize that my
dataframes are ordered by levels of product, but using test[[2]] <-
test[[2]][sample(12),] and applying the same code as above reveals
that indeed no sorting is done but the order is maintained from the
first
dataframe.)
I have a working solution for myself, so I'm not after any advice on
how to achieve the sorting -- I'd just like to better understand
what's going on here and/or what I might have missed in the
documentation or in the list archives.
Thanks in advance,
Michael
Session info:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.1