Skip to content

unexpected (?) behavior of sort=TRUE in merge function

6 messages · arun, Meyners, Michael, Rui Barradas

#
All,

I realize from the archive that the sort argument in merge has been subject to discussion before, though I couldn't find an explanation for this behavior. I tried to simplify this to (kind of) minimal code from a real example to the following (and I have no doubts that there are smart people around achieving the same with smarter code :-)). I'm running R 2.15.1 64bit under MS Windows 7, full session info below.
	
I do have a list with two dataframes:

test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 
4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), 
    cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, 
    1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, 
    5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, 
    2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, 
    4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, 
    3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, 
    6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names = c("product", 
"cong", "x"), row.names = c(NA, -24L), class = "data.frame"), 
    structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 
    6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", 
    "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, 
    -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, 4.18181818181818, 
    3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, 
    5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, 
    4.96703296703297)), .Names = c("product", "cong", "x"), row.names = c(NA, 
    -12L), class = "data.frame"))


The dataframes are pretty much the same but for the values in the x-column and the fact that the second one has only half as many observations, missing the second half of the expand.grid if you like. Now if I run

lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))	  # sort=TRUE is the default, so could be omitted

sorts the first dataframe according to the labels of factor "product", while for the second one the order is maintained from the first dataframes (x) to merge (which is the difference that I could not find being documented). Now I run the same code with sort=FALSE instead:

lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE))

The results are at least consistent and fulfill my needs (this is, btw, not unexpected from the documentation). Note that I get exactly the same behavior if I apply merge subsequently to test[[1]] and test[[2]], so it is not an issue from lapply. (I realize that my dataframes are ordered by levels of product, but using test[[2]] <- test[[2]][sample(12),] and applying the same code as above reveals that indeed no sorting is done but the order is maintained from the first dataframe.)

I have a working solution for myself, so I'm not after any advice on how to achieve the sorting -- I'd just like to better understand what's going on here and/or what I might have missed in the documentation or in the list archives. 

Thanks in advance, 
Michael



Session info:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_2.15.1
#
Hi,
Try this:
convert.type1 <- function(obj,types){
??? for (i in 1:length(obj)){
??????? FUN <- switch(types[i],character = as.character, 
?????????????????????????????????? numeric = as.numeric, 
?????????????????????????????????? factor = as.factor)
??????? obj[,i] <- FUN(obj[,i])
??? }
??? obj
}

test1<-test
?test1[[1]]<-convert.type1(test1[[1]],c("character","numeric","numeric"))
?test1[[2]]<-convert.type1(test1[[2]],c("character","numeric","numeric"))
lapply(test1, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))?? 
------
------
[[2]]
?? product cong??????? x
1??????? F?? -1 4.315789
2??????? F??? 0 5.705263
3??????? F??? 1?????? NA
4??????? F?? 11?????? NA
5??????? G?? -1 3.750000
6??????? G??? 0 5.680000
7??????? G??? 1?????? NA
8??????? G?? 11?????? NA
9??????? K?? -1 3.739130
10?????? K??? 0 4.967033
11?????? K??? 1?????? NA
12?????? K?? 11?????? NA
13?????? L?? -1 4.500000
14?????? L??? 0 6.386364
15?????? L??? 1?????? NA
16?????? L?? 11?????? NA
17????? Y1?? -1 3.043478
18????? Y1??? 0 4.887640
19????? Y1??? 1?????? NA
20????? Y1?? 11?????? NA
21????? Y2?? -1 4.181818
22????? Y2??? 0 5.207921
23????? Y2??? 1?????? NA
24????? Y2?? 11?????? NA

A.K.




----- Original Message -----
From: "Meyners, Michael" <meyners.m at pg.com>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Tuesday, September 4, 2012 7:24 AM
Subject: [R] unexpected (?) behavior of sort=TRUE in merge function

All,

I realize from the archive that the sort argument in merge has been subject to discussion before, though I couldn't find an explanation for this behavior. I tried to simplify this to (kind of) minimal code from a real example to the following (and I have no doubts that there are smart people around achieving the same with smarter code :-)). I'm running R 2.15.1 64bit under MS Windows 7, full session info below.
??? 
I do have a list with two dataframes:

test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 
4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), 
? ? cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, 
? ? 1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, 
? ? 5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, 
? ? 2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, 
? ? 4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, 
? ? 3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, 
? ? 6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names = c("product", 
"cong", "x"), row.names = c(NA, -24L), class = "data.frame"), 
? ? structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 
? ? 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", 
? ? "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, 
? ? -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, 4.18181818181818, 
? ? 3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, 
? ? 5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, 
? ? 4.96703296703297)), .Names = c("product", "cong", "x"), row.names = c(NA, 
? ? -12L), class = "data.frame"))


The dataframes are pretty much the same but for the values in the x-column and the fact that the second one has only half as many observations, missing the second half of the expand.grid if you like. Now if I run

lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))??? ? # sort=TRUE is the default, so could be omitted

sorts the first dataframe according to the labels of factor "product", while for the second one the order is maintained from the first dataframes (x) to merge (which is the difference that I could not find being documented). Now I run the same code with sort=FALSE instead:

lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE))

The results are at least consistent and fulfill my needs (this is, btw, not unexpected from the documentation). Note that I get exactly the same behavior if I apply merge subsequently to test[[1]] and test[[2]], so it is not an issue from lapply. (I realize that my dataframes are ordered by levels of product, but using test[[2]] <- test[[2]][sample(12),] and applying the same code as above reveals that indeed no sorting is done but the order is maintained from the first dataframe.)

I have a working solution for myself, so I'm not after any advice on how to achieve the sorting -- I'd just like to better understand what's going on here and/or what I might have missed in the documentation or in the list archives. 

Thanks in advance, 
Michael



Session info:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252? LC_CTYPE=German_Germany.1252? ? LC_MONETARY=German_Germany.1252 LC_NUMERIC=C? ? ? ? ? ? ? ? ? ? LC_TIME=German_Germany.1252? ? 

attached base packages:
[1] stats? ?  graphics? grDevices utils? ?  datasets? methods?  base? ? 

loaded via a namespace (and not attached):
[1] tools_2.15.1

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Rui, 

Thanks for looking into this. I apologize, I should've added my output, maybe it looks differently on my machine than on others. I also should have made my question more explicit: I'm not looking for a solution to get the sorting one way or another, I have that already. I rather want to understand why the same code behaves differently on two very similar datasets (one just having less rows, see below).

The first call gives the following for me:
[[1]]
   product cong        x
1        F   -1 5.857143
2        F    0 3.625000
3        F    1 4.782609
4        F   11 6.301887
5        G   -1 7.300000
6        G    0 4.800000
7        G    1 4.424242
8        G   11 5.781250
9        K   -1 4.375000
10       K    0 4.714286
11       K    1 3.804348
12       K   11 5.566038
13       L   -1 7.272727
14       L    0 6.250000
15       L    1 4.875000
16       L   11 6.877551
17      Y1   -1 5.857143
18      Y1    0 3.875000
19      Y1    1 3.535714
20      Y1   11 5.731707
21      Y2   -1 5.900000
22      Y2    0 2.500000
23      Y2    1 4.638889
24      Y2   11 5.419355

[[2]]
   product cong        x
1       Y1   -1 3.043478
2       Y1    0 4.887640
3       Y1    1       NA
4       Y1   11       NA
5       Y2   -1 4.181818
6       Y2    0 5.207921
7       Y2    1       NA
8       Y2   11       NA
9        G   -1 3.750000
10       G    0 5.680000
11       G    1       NA
12       G   11       NA
13       F   -1 4.315789
14       F    0 5.705263
15       F    1       NA
16       F   11       NA
17       L   -1 4.500000
18       L    0 6.386364
19       L    1       NA
20       L   11       NA
21       K   -1 3.739130
22       K    0 4.967033
23       K    1       NA
24       K   11       NA
 

So different from what you may have observed, here the first data set [[1]] is sorted by label of "product", not by value. As you correctly stated, Y1" is coded as 1, "Y2" as 2, etc., but the first rows are for F, followed by G etc. The second [[2]] is sorted by level (value). So I have different behavior on very similar looking data sets, and hence to me at least one of those cannot be "right" according to documentation (but I agree with you that the second is correct according to the help). In my larger example, it seems as if data sets which do not originally have all combinations of product and cong anyway are sorted like [[2]], and those that are complete (all 24 combinations occur) are sorted like [[1]] is, which to me is still "unexpected".

Hope this clarifies my question.

Any thoughts appreciated.
Michael
#
Hello,

You're right I had missed the point, sorry.
I can't see a reason why that behavior, but it seems to have to do with 
all = T, remove it and the problem is gone. But that's probably not what 
you want.
NA's issue?

Rui Barradas

Em 04-09-2012 15:17, Meyners, Michael escreveu:
#
Rui,

yes, without all=T it works fine, but of course there is no point in the whole exercise if I'd drop that, as print(test) would do the same, unless I have other values of product or cong in any dataset, which I haven't. :-) 

The purpose of the merge is to have all combinations of the levels of product and cong in each dataframe in my list -- there might be smarter ways, but it does the trick, and I need a unified setup (layout, size and sorting of the data) to ease the following steps in my code. Of course, I could achieve that easily by sorting the data subsequently, so there are multiple ways to get what I want to have. However, the purpose being a bit beyond this post, it's really about the behavior here on data sets that look so similar, and about the fact that one of those is not treated like it should be according to documentation). 

Thanks again for taking the time to reply.

Cheers, Michael