Skip to content

tidyverse: grouped summaries (with summerize)

12 messages · Eric Berger, Avi Gross, Bert Gunter +1 more

#
I changed the data files so the date-times are in five separate columns:
year, month, day, hour, and minute; for example,
year,month,day,hour,min,cfs
2016,03,03,12,00,149000
2016,03,03,12,10,150000
2016,03,03,12,20,151000
2016,03,03,12,30,156000
2016,03,03,12,40,154000
2016,03,03,12,50,150000
2016,03,03,13,00,153000
2016,03,03,13,10,156000
2016,03,03,13,20,154000

The script is based on the example (on page 59 of 'R for Data Science'):
library('tidyverse')
disc <- read.csv('../data/water/disc.dat', header = TRUE, sep = ',', stringsAsFactors = FALSE)
disc$year <- as.integer(disc$year)
disc$month <- as.integer(disc$month)
disc$day <- as.integer(disc$day)
disc$hour <- as.integer(disc$hour)
disc$min <- as.integer(disc$min)
disc$cfs <- as.double(disc$cfs, length = 6)

# use dplyr to filter() by year, month, day; summarize() to get monthly
# means, sds
disc_by_month <- group_by(disc, year, month)
summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

but my syntax is off because the results are:
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
Warning messages:
1: In eval(ei, envir) : NAs introduced by coercion
2: In eval(ei, envir) : NAs introduced by coercion
[1] "disc"          "disc_by_month"
# A tibble: 590,940 ? 6
# Groups:   year, month [66]
     year month   day  hour   min    cfs
    <int> <int> <int> <int> <int>  <dbl>
  1  2016     3     3    12     0 149000
  2  2016     3     3    12    10 150000
  3  2016     3     3    12    20 151000
  4  2016     3     3    12    30 156000
  5  2016     3     3    12    40 154000
  6  2016     3     3    12    50 150000
  7  2016     3     3    13     0 153000
  8  2016     3     3    13    10 156000
  9  2016     3     3    13    20 154000
10  2016     3     3    13    30 155000
# ? with 590,930 more rows

I have the same results if I use as.numeric rather than as.integer and
as.double. What am I doing incorrectly?

TIA,

Rich
#
Rich,

Did I miss something? The summarise() command is telling you that  you had not implicitly grouped the data and it made a guess. The canonical way is:

... %>% group_by(year, month, day, hour) %>% summarise(...)


You decide which fields to group by, sometimes including others so they are in the output. 

Avi

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 4:53 PM
To: r-help at r-project.org
Subject: [R] tidyverse: grouped summaries (with summerize)

I changed the data files so the date-times are in five separate columns:
year, month, day, hour, and minute; for example, year,month,day,hour,min,cfs
2016,03,03,12,00,149000
2016,03,03,12,10,150000
2016,03,03,12,20,151000
2016,03,03,12,30,156000
2016,03,03,12,40,154000
2016,03,03,12,50,150000
2016,03,03,13,00,153000
2016,03,03,13,10,156000
2016,03,03,13,20,154000

The script is based on the example (on page 59 of 'R for Data Science'):
library('tidyverse')
disc <- read.csv('../data/water/disc.dat', header = TRUE, sep = ',', stringsAsFactors = FALSE) disc$year <- as.integer(disc$year) disc$month <- as.integer(disc$month) disc$day <- as.integer(disc$day) disc$hour <- as.integer(disc$hour) disc$min <- as.integer(disc$min) disc$cfs <- as.double(disc$cfs, length = 6)

# use dplyr to filter() by year, month, day; summarize() to get monthly # means, sds disc_by_month <- group_by(disc, year, month) summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

but my syntax is off because the results are:
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
Warning messages:
1: In eval(ei, envir) : NAs introduced by coercion
2: In eval(ei, envir) : NAs introduced by coercion
[1] "disc"          "disc_by_month"
# A tibble: 590,940 ? 6
# Groups:   year, month [66]
     year month   day  hour   min    cfs
    <int> <int> <int> <int> <int>  <dbl>
  1  2016     3     3    12     0 149000
  2  2016     3     3    12    10 150000
  3  2016     3     3    12    20 151000
  4  2016     3     3    12    30 156000
  5  2016     3     3    12    40 154000
  6  2016     3     3    12    50 150000
  7  2016     3     3    13     0 153000
  8  2016     3     3    13    10 156000
  9  2016     3     3    13    20 154000
10  2016     3     3    13    30 155000
# ? with 590,930 more rows

I have the same results if I use as.numeric rather than as.integer and as.double. What am I doing incorrectly?

TIA,

Rich

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On Mon, 13 Sep 2021, Avi Gross via R-help wrote:

            
Avi,

Probably not.
After sending the message I saw the example using %>% and didn't realize
that it made a difference from the previous example.
That's what I thought I did. I'll rewrite the script and work toward the
output I need.

Thanks,

Rich
#
On Mon, 13 Sep 2021, Rich Shepard wrote:

            
Still not the correct syntax. Command is now:
disc_by_month %>%
     group_by(year, month) %>%
     summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

and results are:
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# A tibble: 590,940 ? 6
# Groups:   year, month [66]
     year month   day  hour   min    cfs
    <int> <int> <int> <int> <int>  <dbl>
  1  2016     3     3    12     0 149000
  2  2016     3     3    12    10 150000
  3  2016     3     3    12    20 151000
  4  2016     3     3    12    30 156000
  5  2016     3     3    12    40 154000
  6  2016     3     3    12    50 150000
  7  2016     3     3    13     0 153000
  8  2016     3     3    13    10 156000
  9  2016     3     3    13    20 154000
10  2016     3     3    13    30 155000
# ? with 590,930 more rows

The grouping is still not right. I expected to see a mean value for each
month of each year in the data set, not for each minute.

Rich
#
This code is not correct:
disc_by_month %>%
     group_by(year, month) %>%
     summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

It should be:

disc %>% group_by(year,month) %>% summarize(vol=mean(cfs,na.rm=TRUE)





On Tue, Sep 14, 2021 at 12:51 AM Rich Shepard <rshepard at appl-ecosys.com>
wrote:

  
  
#
On Tue, 14 Sep 2021, Eric Berger wrote:

            
Eric/Avi:

That makes no difference:
# A tibble: 590,940 ? 6
# Groups:   year, month [66]
     year month   day  hour   min    cfs
    <int> <int> <int> <int> <int>  <dbl>
  1  2016     3     3    12     0 149000
  2  2016     3     3    12    10 150000
  3  2016     3     3    12    20 151000
  4  2016     3     3    12    30 156000
  5  2016     3     3    12    40 154000
  6  2016     3     3    12    50 150000
  7  2016     3     3    13     0 153000
  8  2016     3     3    13    10 156000
  9  2016     3     3    13    20 154000
10  2016     3     3    13    30 155000
# ? with 590,930 more rows

I wondered if I need to group first by hour, then day, then year-month.
This, too, produces the same output:

disc %>%
     group_by(hour) %>%
     group_by(day) %>%
     group_by(year, month) %>%
     summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

And disc shows the read dataframe.

I don't understand why the columns are not grouping.

Thanks,

Rich
#
As Eric has pointed out, perhaps Rich is not thinking pipelined. Summarize() takes a first argument as:
	summarise(.data=whatever, ...)

But in a pipeline, you OMIT the first argument and let the pipeline supply an argument silently.

What I think summarize saw was something like:

summarize(. , disc_by_month, vol = mean(cfs, na.rm = TRUE))

There is now a superfluous SECOND argument in a place it expected not a data.frame type of variable but the name of a column in the hidden data.frame-like object it was passed. You do not have a column called disc_by_month and presumably some weird logic made it suggest it was replacing that by the first column or something.

I hope this makes sense. You do not cobble a pipeline together from parts without carefully making sure all first arguments otherwise used are NOT used.

And, just FYI, the subject line should not use a word that some see as the opposite companion of "winterize" ...

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 5:51 PM
To: r-help at r-project.org
Subject: Re: [R] tidyverse: grouped summaries (with summerize)
On Mon, 13 Sep 2021, Rich Shepard wrote:

            
Still not the correct syntax. Command is now:
disc_by_month %>%
     group_by(year, month) %>%
     summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

and results are:
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# A tibble: 590,940 ? 6
# Groups:   year, month [66]
     year month   day  hour   min    cfs
    <int> <int> <int> <int> <int>  <dbl>
  1  2016     3     3    12     0 149000
  2  2016     3     3    12    10 150000
  3  2016     3     3    12    20 151000
  4  2016     3     3    12    30 156000
  5  2016     3     3    12    40 154000
  6  2016     3     3    12    50 150000
  7  2016     3     3    13     0 153000
  8  2016     3     3    13    10 156000
  9  2016     3     3    13    20 154000
10  2016     3     3    13    30 155000
# ? with 590,930 more rows

The grouping is still not right. I expected to see a mean value for each month of each year in the data set, not for each minute.

Rich

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On Mon, 13 Sep 2021, Avi Gross via R-help wrote:

            
Avi,

Thank you. I read your message carefully and re-read the example on the
bottom of page 60 and top of page 61. Then changed the command to:
disc_by_month = disc %>%
     group_by(year, month) %>%
     summarize(vol = mean(cfs, na.rm = TRUE))

And, the script now returns what I need:
# A tibble: 66 ? 3
# Groups:   year [7]
     year month     vol
    <int> <int>   <dbl>
  1  2016     3 221840.
  2  2016     4 288589.
  3  2016     5 255164.
  4  2016     6 205371.
  5  2016     7 167252.
  6  2016     8 140465.
  7  2016     9  97779.
  8  2016    10 135482.
  9  2016    11 166808.
10  2016    12 165787.

I missed the beginning of the command where the resulting dataframe needs to
be named first.

This clarifies my understanding and I appreciate your and Eric's help.

Regards,

Rich
#
Just FYI, Rich, the way the idiom with pipeline works does allow but not require the method you used:

Yours was
  RESULT <-
    DATAFRAME %>%
    FN1(args) %>%
    ...
    FNn(args)
    
But equally valid are forms that assign the result at the end:

    DATAFRAME %>%
    FN1(args) %>%
    ...
    FNn(args) -> RESULT

Or that supply the first argument to just the first function:

    FN1(DATAFRAME, args) %>%
    ...
    FNn(args) -> RESULT

And if you read some tutorials, there are many other things you can do including variants on the pipe symbol to do other things but also how to put the variable returned into a different part (not the first position) of the argument that follows and lots more. Some people spend most of the programming time relatively purely in the tidyverse functions without looking much at base R.

I am not saying that is a good thing.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 7:04 PM
To: r-help at r-project.org
Subject: Re: [R] tidyverse: grouped summaries (with summarize) [RESOLVED]
On Mon, 13 Sep 2021, Avi Gross via R-help wrote:

            
Avi,

Thank you. I read your message carefully and re-read the example on the bottom of page 60 and top of page 61. Then changed the command to:
disc_by_month = disc %>%
     group_by(year, month) %>%
     summarize(vol = mean(cfs, na.rm = TRUE))

And, the script now returns what I need:
# A tibble: 66 ? 3
# Groups:   year [7]
     year month     vol
    <int> <int>   <dbl>
  1  2016     3 221840.
  2  2016     4 288589.
  3  2016     5 255164.
  4  2016     6 205371.
  5  2016     7 167252.
  6  2016     8 140465.
  7  2016     9  97779.
  8  2016    10 135482.
  9  2016    11 166808.
10  2016    12 165787.

I missed the beginning of the command where the resulting dataframe needs to be named first.

This clarifies my understanding and I appreciate your and Eric's help.

Regards,

Rich

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On Mon, 13 Sep 2021, Avi Gross via R-help wrote:

            
...
Avi,

I'll read more about tidyverse and summarize() in R and not just in the
book.

Most of what I've done has been in base R, but I've not before grouped
hydraulic values before plotting them. Seasonal patterns are more
informative than daily ones.

Thanks again,

Rich
#
If you are interested in extracting seasonal patterns from time
series, you might wish to check out ?stl (in the stats package). Of
course, there are all sorts of ways in many packages to fit
seasonality in time series that are more sophisticated, but probably
also more complicated, than your manual summarization and plotting.


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Sep 13, 2021 at 5:15 PM Rich Shepard <rshepard at appl-ecosys.com> wrote:
#
On Mon, 13 Sep 2021, Bert Gunter wrote:

            
Bert,

Thanks for the suggestions.

Regards,

Rich