Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK =
c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in
#2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it
a bug? A peculiar thing in my dataset? I tend to think this could be due
to some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L,
113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and Health
of the State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz Association.
aggregate formula - differing results
9 messages · Iago Giné Vázquez, Ivan Calandra, Rui Barradas +2 more
It seems that the issue are the missings. If in #1 you use the dataset na.omit(my_data) instead of my_data, you get the same output that in #2 and in #4, where all observations with missing data are removed since you are including all the variables. The second dataset has no issue since it has no missing data. Iago
De: R-help <r-help-bounces at r-project.org> de part de Ivan Calandra <ivan.calandra at leiza.de>
Enviat el: dilluns, 4 de setembre de 2023 11:44
Per a: R-help <r-help at r-project.org>
Tema: [R] aggregate formula - differing results
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK =
c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in
#2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it
a bug? A peculiar thing in my dataset? I tend to think this could be due
to some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L,
113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and Health
of the State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz Association.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Thanks Iago for the pointer. It then means that na.rm = TRUE is not applied in the same way within aggregate() as opposed to dplyr::group_by() + summarise(), right? Within aggregate, it behaves like na.omit(), that is, it excludes the incomplete cases (whole rows), whereas with group_by() + summarise() it is applied on each vector (variable), which is what I actually would expect. I hadn't showed it, but doBy::summaryBy() produces the same results as group_by() + summarise(). Ivan
On 04/09/2023 12:45, Iago Gin? V?zquez wrote:
It seems that the issue are the missings. If in? #1 you use the
dataset na.omit(my_data) instead of my_data, you get the same output
that in #2 and in #4, where all observations with missing data are
removed since you are including all the variables.
The second dataset has no issue since it has no missing data.
Iago
------------------------------------------------------------------------
*De:* R-help <r-help-bounces at r-project.org> de part de Ivan Calandra
<ivan.calandra at leiza.de>
*Enviat el:* dilluns, 4 de setembre de 2023 11:44
*Per a:* R-help <r-help at r-project.org>
*Tema:* [R] aggregate formula - differing results
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK =
c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
?? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in
#2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it
a bug? A peculiar thing in my dataset? I tend to think this could be due
to some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L,
113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
??????? *LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and Health
of the State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK =
c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in
#2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it
a bug? A peculiar thing in my dataset? I tend to think this could be due
to some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L,
113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
????*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and Health
of the State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the data
with it. Then the default na.action = na.omit will no longer remove the
rows with NA vals in at least one column and the results are the same.
However, this will not give the mean values of the other numeric
columns, just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")
# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
mean, na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
select(all_of(cols)) %>%
group_by(RAWMAT) %>%
summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
Hope this helps,
Rui Barradas
Thanks Rui for your help; that would be one possibility indeed. But am I the only one who finds that behavior of aggregate() completely unexpected and confusing? Especially considering that dplyr::summarise() and doBy::summaryBy() deal with NAs differently, even though they all use mean(na.rm = TRUE) to calculate the group stats. Best wishes, Ivan
On 04/09/2023 13:46, Rui Barradas wrote:
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102",
"HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,
130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea =
c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61,
1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011,
54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982,
67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417,
35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44, 14.88, 13.86,
NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3, 16.52,
4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA, 56),
THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA, 10.08),
WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT
= c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")), row.names = c(1L,
2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class = "data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
?? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as
in #2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =
mean, na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is
it a bug? A peculiar thing in my dataset? I tend to think this could
be due to some variables (or their names) as all notations seem to
agree when I remove some variables (although I haven't found out
which variable(s) is (are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102",
"HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,
130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea =
c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61,
1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011,
54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982,
67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417,
35.297), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT",
"HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L),
class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,
data = my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
?????*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and
Health of the State of Rhineland-Palatinate. LEIZA is a research
museum of the Leibniz Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the data
with it. Then the default na.action = na.omit will no longer remove
the rows with NA vals in at least one column and the results are the
same.
However, this will not give the mean values of the other numeric
columns, just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")
# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
mean, na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
? select(all_of(cols)) %>%
? group_by(RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
Hope this helps,
Rui Barradas
?s 12:51 de 04/09/2023, Ivan Calandra escreveu:
Thanks Rui for your help; that would be one possibility indeed. But am I the only one who finds that behavior of aggregate() completely unexpected and confusing? Especially considering that dplyr::summarise() and doBy::summaryBy() deal with NAs differently, even though they all use mean(na.rm = TRUE) to calculate the group stats. Best wishes, Ivan On 04/09/2023 13:46, Rui Barradas wrote:
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102",
"HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,
130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea =
c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61,
1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011,
54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982,
67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417,
35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44, 14.88, 13.86,
NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3, 16.52,
4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA, 56),
THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA, 10.08),
WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT
= c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")), row.names = c(1L,
2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class = "data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
?? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as
in #2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =
mean, na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is
it a bug? A peculiar thing in my dataset? I tend to think this could
be due to some variables (or their names) as all notations seem to
agree when I remove some variables (although I haven't found out
which variable(s) is (are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102",
"HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,
130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea =
c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61,
1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011,
54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982,
67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417,
35.297), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT",
"HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L),
class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,
data = my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
?????*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and
Health of the State of Rhineland-Palatinate. LEIZA is a research
museum of the Leibniz Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the data
with it. Then the default na.action = na.omit will no longer remove
the rows with NA vals in at least one column and the results are the
same.
However, this will not give the mean values of the other numeric
columns, just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")
# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
mean, na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
? select(all_of(cols)) %>%
? group_by(RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
Hope this helps,
Rui Barradas
Hello, Puzzling at first yes, unexpected no, it's documented behavior. This is the result of how the aggregate works, by first applying na.action to the data and only then applying the function to each column in the formula's LHS. dplyr works column by column, so there is no na.action involved. Hope this helps, Rui Barradas
Haha, got it now, there is an na.action argument (which defaults to na.omit) to aggregate() which is applied before calling mean(na.rm = TRUE). Thank you Rui for pointing this out. So running it with na.pass instead of na.omit gives the same results as dplyr::group_by()+summarise(): aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE, na.action = na.pass) Cheers, Ivan
On 04/09/2023 13:56, Rui Barradas wrote:
?s 12:51 de 04/09/2023, Ivan Calandra escreveu:
Thanks Rui for your help; that would be one possibility indeed. But am I the only one who finds that behavior of aggregate() completely unexpected and confusing? Especially considering that dplyr::summarise() and doBy::summaryBy() deal with NAs differently, even though they all use mean(na.rm = TRUE) to calculate the group stats. Best wishes, Ivan On 04/09/2023 13:46, Rui Barradas wrote:
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100", "FLINT-101", "FLINT-102", "HORN-10", "HORN-100",
"HORN-102", "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77,
142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84, NA, 15.33,
30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK = c(8.67, NA,
7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78, NA, 78,
54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6,
14.9, 29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT",
"FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L,
114L, 115L), class = "data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN =
mean, na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
?? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results
as in #2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =
mean, na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot).
Is it a bug? A peculiar thing in my dataset? I tend to think this
could be due to some variables (or their names) as all notations
seem to agree when I remove some variables (although I haven't
found out which variable(s) is (are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100", "FLINT-101", "FLINT-102", "HORN-10", "HORN-100",
"HORN-102", "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77,
142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT", "FLINT",
"FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L,
112L, 113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,
data = my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
?????*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and
Health of the State of Rhineland-Palatinate. LEIZA is a research
museum of the Leibniz Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the
data with it. Then the default na.action = na.omit will no longer
remove the rows with NA vals in at least one column and the results
are the same.
However, this will not give the mean values of the other numeric
columns, just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")
# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
mean, na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
? select(all_of(cols)) %>%
? group_by(RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
Hope this helps,
Rui Barradas
Hello, Puzzling at first yes, unexpected no, it's documented behavior. This is the result of how the aggregate works, by first applying na.action to the data and only then applying the function to each column in the formula's LHS. dplyr works column by column, so there is no na.action involved. Hope this helps, Rui Barradas
Ivan: Just one perhaps extraneous comment. You said that you were surprised that aggregate() and group_by() did not have the same behavior. That is a misconception on your part. As you know, the tidyverse recapitulates the functionality of many base R functions; but it makes no claims to do so in exactly the same way and, indeed, often makes deliberate changes to "improve" behavior. So if you wish to use both, you should *expect* such differences, which, of course, are documented in the man pages (and often elsewhere). Cheers, Bert
On Mon, Sep 4, 2023 at 5:21?AM Ivan Calandra <ivan.calandra at leiza.de> wrote:
Haha, got it now, there is an na.action argument (which defaults to na.omit) to aggregate() which is applied before calling mean(na.rm = TRUE). Thank you Rui for pointing this out. So running it with na.pass instead of na.omit gives the same results as dplyr::group_by()+summarise(): aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE, na.action = na.pass) Cheers, Ivan On 04/09/2023 13:56, Rui Barradas wrote:
?s 12:51 de 04/09/2023, Ivan Calandra escreveu:
Thanks Rui for your help; that would be one possibility indeed. But am I the only one who finds that behavior of aggregate() completely unexpected and confusing? Especially considering that dplyr::summarise() and doBy::summaryBy() deal with NAs differently, even though they all use mean(na.rm = TRUE) to calculate the group stats. Best wishes, Ivan On 04/09/2023 13:46, Rui Barradas wrote:
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100", "FLINT-101", "FLINT-102", "HORN-10", "HORN-100",
"HORN-102", "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77,
142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84, NA, 15.33,
30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK = c(8.67, NA,
7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78, NA, 78,
54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6,
14.9, 29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT",
"FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L,
114L, 115L), class = "data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN =
mean, na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results
as in #2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =
mean, na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot).
Is it a bug? A peculiar thing in my dataset? I tend to think this
could be due to some variables (or their names) as all notations
seem to agree when I remove some variables (although I haven't
found out which variable(s) is (are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100", "FLINT-101", "FLINT-102", "HORN-10", "HORN-100",
"HORN-102", "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77,
142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT", "FLINT",
"FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L,
112L, 113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,
data = my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and
Health of the State of Rhineland-Palatinate. LEIZA is a research
museum of the Leibniz Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the
data with it. Then the default na.action = na.omit will no longer
remove the rows with NA vals in at least one column and the results
are the same.
However, this will not give the mean values of the other numeric
columns, just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")
# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
mean, na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
select(all_of(cols)) %>%
group_by(RAWMAT) %>%
summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
Hope this helps,
Rui Barradas
Hello, Puzzling at first yes, unexpected no, it's documented behavior. This is the result of how the aggregate works, by first applying na.action to the data and only then applying the function to each column in the formula's LHS. dplyr works column by column, so there is no na.action involved. Hope this helps, Rui Barradas
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Mon, 4 Sep 2023, Ivan Calandra wrote:
Thanks Rui for your help; that would be one possibility indeed. But am I the only one who finds that behavior of aggregate() completely unexpected and confusing? Especially considering that dplyr::summarise() and doBy::summaryBy() deal with NAs differently, even though they all use mean(na.rm = TRUE) to calculate the group stats.
I agree with Rui that this behaves as documented but I also agree with
Ivan that the behavior is potentially confusing. Not so much because other
packages behave differently but mostly because the handling of missing
values differs between the different aggregate() methods.
Based on my teaching experience, I feel that a default of
na.action=na.pass would be less confusing, especially in the case with
multivariate "response".
In the univeriate case the discrepancy can be surprising - in the default
method you need na.rm=TRUE but in the formula method you get the same
result without additional arguments (due to na.action=na.omit). But in the
multivariate case the discrepancy is not obvious, especially for
beginners, because the results in other variables without NAs are affected
as well.
A minimal toy example is the following data with two groups (x = A vs. B)
and two "responses" (y without NAs and z with NA):
d <- data.frame(x = c("A", "A", "B", "B"), y = 1:4, z = c(1:3, NA))
d
## x y z
## 1 A 1 1
## 2 A 2 2
## 3 B 3 3
## 4 B 4 NA
Except for naming of the columns, both of the following summaries for y by
x (without NAs) yield the same result:
aggregate(d$y, list(d$x), FUN = mean)
aggregate(y ~ x, data = d, FUN = mean)
## x y
## 1 A 1.5
## 2 B 3.5
For a single variable _with_ NAs, the default method needs the na.rm =
TRUE argument, the fomula method does not. Again, except for naming of the
columns:
aggregate(d$z, list(d$x), FUN = mean, na.rm = TRUE)
aggregate(z ~ x, data = d, FUN = mean)
## x z
## 1 A 1.5
## 2 B 3.0
Conversely, if you do want the NAs in the groups, the following two are
the same (except for naming):
aggregate(d$z, list(d$x), FUN = mean)
aggregate(z ~ x, data = d, FUN = mean, na.action = na.pass)
## x z
## 1 A 1.5
## 2 B NA
But in the multivariate case, it is not so obvious why the following two
commands differ in their results for y (!), the variable without NAs, in
group B:
aggregate(d[, c("y", "z")], list(d$x), FUN = mean, na.rm = TRUE)
## Group.1 y z
## 1 A 1.5 1.5
## 2 B 3.5 3.0
^^^
aggregate(cbind(y, z) ~ x, data = d, FUN = mean)
## x y z
## 1 A 1.5 1.5
## 2 B 3.0 3.0
^^^
Hence, in my programming courses I tell students to use na.action=na.pass
in the formula method and to handle NAs in the FUN argument themselves.
I guess that this is not important enough to change the default in
aggregate.formula. Or are there R core members who also find that this
inconsistency between the different methods is worth addressing?
If not, maybe an explicit example could be added on the help page? Showing
something like this might help:
## default: omit enitre row 4 where z=NA
aggregate(cbind(y, z) ~ x, data = d, FUN = mean)
## x y z
## 1 A 1.5 1.5
## 2 B 3.0 3.0
## alternatively: omit row 4 only for z result but not for y result
aggregate(cbind(y, z) ~ x, data = d, FUN = mean, na.action = na.pass, na.rm = TRUE)
## x y z
## 1 A 1.5 1.5
## 2 B 3.5 3.0
Best wishes,
Achim
On 04/09/2023 13:46, Rui Barradas wrote:
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41, 121.37,
70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83, 1656.46,
1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length =
c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427, 20.683,
62.783, 26.417, 35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44,
14.88, 13.86, NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3,
16.52, 4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA,
56), THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA, 10.08),
WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT =
c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L,
111L, 112L, 113L, 114L, 115L), class = "data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean, na.rm
= TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
?? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in #2
but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it a
bug? A peculiar thing in my dataset? I tend to think this could be due to
some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41, 121.37,
70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83, 1656.46,
1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length =
c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427, 20.683,
62.783, 26.417, 35.297), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
?????*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in Mainz.
Supervision is carried out by the Ministry of Science and Health of the
State of Rhineland-Palatinate. LEIZA is a research museum of the Leibniz
Association.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the data with
it. Then the default na.action = na.omit will no longer remove the rows
with NA vals in at least one column and the results are the same.
However, this will not give the mean values of the other numeric columns,
just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")
# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN = mean,
na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
? select(all_of(cols)) %>%
? group_by(RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
Hope this helps,
Rui Barradas
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.