Skip to content
Prev 304783 / 398506 Next

unexpected (?) behavior of sort=TRUE in merge function

Rui, 

Thanks for looking into this. I apologize, I should've added my output, maybe it looks differently on my machine than on others. I also should have made my question more explicit: I'm not looking for a solution to get the sorting one way or another, I have that already. I rather want to understand why the same code behaves differently on two very similar datasets (one just having less rows, see below).

The first call gives the following for me:
[[1]]
   product cong        x
1        F   -1 5.857143
2        F    0 3.625000
3        F    1 4.782609
4        F   11 6.301887
5        G   -1 7.300000
6        G    0 4.800000
7        G    1 4.424242
8        G   11 5.781250
9        K   -1 4.375000
10       K    0 4.714286
11       K    1 3.804348
12       K   11 5.566038
13       L   -1 7.272727
14       L    0 6.250000
15       L    1 4.875000
16       L   11 6.877551
17      Y1   -1 5.857143
18      Y1    0 3.875000
19      Y1    1 3.535714
20      Y1   11 5.731707
21      Y2   -1 5.900000
22      Y2    0 2.500000
23      Y2    1 4.638889
24      Y2   11 5.419355

[[2]]
   product cong        x
1       Y1   -1 3.043478
2       Y1    0 4.887640
3       Y1    1       NA
4       Y1   11       NA
5       Y2   -1 4.181818
6       Y2    0 5.207921
7       Y2    1       NA
8       Y2   11       NA
9        G   -1 3.750000
10       G    0 5.680000
11       G    1       NA
12       G   11       NA
13       F   -1 4.315789
14       F    0 5.705263
15       F    1       NA
16       F   11       NA
17       L   -1 4.500000
18       L    0 6.386364
19       L    1       NA
20       L   11       NA
21       K   -1 3.739130
22       K    0 4.967033
23       K    1       NA
24       K   11       NA
 

So different from what you may have observed, here the first data set [[1]] is sorted by label of "product", not by value. As you correctly stated, Y1" is coded as 1, "Y2" as 2, etc., but the first rows are for F, followed by G etc. The second [[2]] is sorted by level (value). So I have different behavior on very similar looking data sets, and hence to me at least one of those cannot be "right" according to documentation (but I agree with you that the second is correct according to the help). In my larger example, it seems as if data sets which do not originally have all combinations of product and cong anyway are sorted like [[2]], and those that are complete (all 24 combinations occur) are sorted like [[1]] is, which to me is still "unexpected".

Hope this clarifies my question.

Any thoughts appreciated.
Michael