dplyr/summarize does not create a true data frame

4 messages · John Posner, John Kane, Hadley Wickham

Original

1

4

John Posner

Fri, Nov 21, 2014 9:10 AM #

I got an error when trying to extract a 1-column subset of a data frame (called "my.output") created by dplyr/summarize. The ncol() function says that my.output has 4 columns, but "my.output[4]" fails. Note that converting my.output using as.data.frame() makes for a happy ending.

Is this the intended behavior of dplyr?

Tx,
John

+   Id = paste("P", sprintf("%04d", 1:rows), sep=""),
+   Sex = sample(rep(sexes, repcnt), rows, replace=T),
+   Height = sample(rep(heights, repcnt), rows, replace=T),
+   V1 = round(runif(rows)*25, 2) + 50,
+   V2 = round(runif(rows)*1000, 2) + 50,
+   V3 = round(runif(rows)*350, 2) - 175
+ )

+   group_by(Sex, Height) %>%
+   summarize(V1sum=sum(V1), V2sum=sum(V2))

[1] 4

Source: local data frame [6 x 1]
Groups: Sex

     Sex
1 Female
2 Female
3 Female
4   Male
5   Male
6   Male

Error in eval(expr, envir, enclos) : index out of bounds  ######## ERROR HERE

John Kane

Fri, Nov 21, 2014 9:32 AM #

Your code in creating 'frm' is not working for me and it is complicated enough that I don't want to work it out. See ?dput for a better way to supply data. Also see:
https://github.com/hadley/devtools/wiki/Reproducibility
 http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

That said, I don't see why 'my.output[4]' is not working.  Try something like str(frm) to see what you have there and/or resubmit the data in dput format

See simple example below:

dat1  <- data.frame(aa = sample(1:20, 100, replace = TRUE), bb = 1:100 )
dat1[2]

John Kane
Kingston ON Canada

-----Original Message-----
From: john.posner at mjbiostat.com
Sent: Fri, 21 Nov 2014 17:10:16 +0000
To: r-help at r-project.org
Subject: [R] dplyr/summarize does not create a true data frame

I got an error when trying to extract a 1-column subset of a data frame
(called "my.output") created by dplyr/summarize. The ncol() function says
that my.output has 4 columns, but "my.output[4]" fails. Note that
converting my.output using as.data.frame() makes for a happy ending.

Is this the intended behavior of dplyr?

Tx,
John

library(dplyr)

# set up data frame
rows = 100
repcnt = 50
sexes = c("Female", "Male")
heights = c("Med", "Short", "Tall")

frm = data.frame(

+   Id = paste("P", sprintf("%04d", 1:rows), sep=""),
+   Sex = sample(rep(sexes, repcnt), rows, replace=T),
+   Height = sample(rep(heights, repcnt), rows, replace=T),
+   V1 = round(runif(rows)*25, 2) + 50,
+   V2 = round(runif(rows)*1000, 2) + 50,
+   V3 = round(runif(rows)*350, 2) - 175
+ )

# use dplyr/summarize to create data frame
my.output = frm %>%

+   group_by(Sex, Height) %>%
+   summarize(V1sum=sum(V1), V2sum=sum(V2))

# work with columns in the output data frame
ncol(my.output)

[1] 4

my.output[1]

Source: local data frame [6 x 1]
Groups: Sex

     Sex
1 Female
2 Female
3 Female
4   Male
5   Male
6   Male

my.output[4]

Error in eval(expr, envir, enclos) : index out of bounds  ######## ERROR
HERE

as.data.frame(my.output)[4]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

____________________________________________________________
FREE ONLINE PHOTOSHARING - Share your photos online with your friends and family!
Visit http://www.inbox.com/photosharing to find out more!

1 day later

John Posner

Sun, Nov 23, 2014 8:42 AM #

Thanks to John Kane for an off-list consultation. As the following annotated transcript shows, it's the group_by() function that transforms a data frame into something else:  a "grouped_df" object that *looks* identical to the original data frame (e.g. the rows are in the original order -- *not* grouped, as arrange() would do), but does not always act like a data frame.

+ "P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class = "factor"), 
+     Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("Female", 
+     "Male"), class = "factor"), Height = structure(c(1L, 1L, 
+     3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium", 
+     "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77, 
+     73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id", 
+ "Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")

'data.frame':	10 obs. of  4 variables:
 $ Id    : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
 $ Sex   : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
 $ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
 $ Value : num  69.5 64.6 74.8 73.3 64.8 ...

Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	10 obs. of  4 variables:
 $ Id    : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
 $ Sex   : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
 $ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
 $ Value : num  69.5 64.6 74.8 73.3 64.8 ...
 - attr(*, "vars")=List of 2
  ..$ : symbol Sex
  ..$ : symbol Height
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 5
  ..$ : int  1 6 9
  ..$ : int 2
  ..$ : int  0 4 8
  ..$ : int  3 7
  ..$ : int 5
 - attr(*, "group_sizes")= int  3 1 3 2 1
 - attr(*, "biggest_group_size")= int 3
 - attr(*, "labels")='data.frame':	5 obs. of  2 variables:
  ..$ Sex   : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
  ..$ Height: Factor w/ 3 levels "Short","Medium",..: 1 3 1 2 3
  ..- attr(*, "vars")=List of 2
  .. ..$ : symbol Sex
  .. ..$ : symbol Height

Id  Sex Height Value
 [1,] TRUE TRUE   TRUE  TRUE
 [2,] TRUE TRUE   TRUE  TRUE
 [3,] TRUE TRUE   TRUE  TRUE
   ...etc.

Value
1  69.47
2  64.61
   ...etc.

Error in eval(expr, envir, enclos) : index out of bounds

Value
1  69.47
2  64.61
   ...etc.

################################## dput() code below

structure(list(Id = structure(1:10, .Label = c("P01", "P02", 
"P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class = "factor"), 
    Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("Female", 
    "Male"), class = "factor"), Height = structure(c(1L, 1L, 
    3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium", 
    "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77, 
    73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id", 
"Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")

-----Original Message-----
From: John Kane [mailto:jrkrideau at inbox.com]
Sent: Friday, November 21, 2014 12:33 PM
To: John Posner; 'r-help at r-project.org'
Subject: RE: [R] dplyr/summarize does not create a true data frame

Your code in creating 'frm' is not working for me and it is complicated enough
that I don't want to work it out. See ?dput for a better way to supply data.
Also see:
https://github.com/hadley/devtools/wiki/Reproducibility
 http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-
reproducible-example

That said, I don't see why 'my.output[4]' is not working.  Try something like
str(frm) to see what you have there and/or resubmit the data in dput format

See simple example below:

dat1  <- data.frame(aa = sample(1:20, 100, replace = TRUE), bb = 1:100 )
dat1[2]

John Kane
Kingston ON Canada

-----Original Message-----
From: john.posner at mjbiostat.com
Sent: Fri, 21 Nov 2014 17:10:16 +0000
To: r-help at r-project.org
Subject: [R] dplyr/summarize does not create a true data frame

I got an error when trying to extract a 1-column subset of a data
frame (called "my.output") created by dplyr/summarize. The ncol()
function says that my.output has 4 columns, but "my.output[4]" fails.
Note that converting my.output using as.data.frame() makes for a happy

ending.

Is this the intended behavior of dplyr?

Tx,
John

library(dplyr)

# set up data frame
rows = 100
repcnt = 50
sexes = c("Female", "Male")
heights = c("Med", "Short", "Tall")

frm = data.frame(

+   Id = paste("P", sprintf("%04d", 1:rows), sep=""),
+   Sex = sample(rep(sexes, repcnt), rows, replace=T),
+   Height = sample(rep(heights, repcnt), rows, replace=T),
+   V1 = round(runif(rows)*25, 2) + 50,
+   V2 = round(runif(rows)*1000, 2) + 50,
+   V3 = round(runif(rows)*350, 2) - 175
+ )

# use dplyr/summarize to create data frame my.output = frm %>%

+   group_by(Sex, Height) %>%
+   summarize(V1sum=sum(V1), V2sum=sum(V2))

# work with columns in the output data frame
ncol(my.output)

[1] 4

my.output[1]

Source: local data frame [6 x 1]
Groups: Sex

     Sex
1 Female
2 Female
3 Female
4   Male
5   Male
6   Male

my.output[4]

Error in eval(expr, envir, enclos) : index out of bounds  ########
ERROR HERE

as.data.frame(my.output)[4]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__________________________________________________________
__
FREE ONLINE PHOTOSHARING - Share your photos online with your friends
and family!
Visit http://www.inbox.com/photosharing to find out more!

Sun, Nov 23, 2014 9:34 AM #

This bug is fixed in the dev version.
Hadley

On Sunday, November 23, 2014, John Posner <john.posner at mjbiostat.com> wrote:

Thanks to John Kane for an off-list consultation. As the following
annotated transcript shows, it's the group_by() function that transforms a
data frame into something else:  a "grouped_df" object that *looks*
identical to the original data frame (e.g. the rows are in the original
order -- *not* grouped, as arrange() would do), but does not always act
like a data frame.

library(dplyr)

# set up data frame, and show its structure [ see below for clean copy

of dput() code ]

frm = structure(list(Id = structure(1:10, .Label = c("P01", "P02",

+ "P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class =
"factor"),
+     Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label =
c("Female",
+     "Male"), class = "factor"), Height = structure(c(1L, 1L,
+     3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium",
+     "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77,
+     73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id",
+ "Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")

str(frm)

'data.frame':   10 obs. of  4 variables:
 $ Id    : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
 $ Sex   : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
 $ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
 $ Value : num  69.5 64.6 74.8 73.3 64.8 ...

# run group_by() on data frame, and show resulting structure

after.group_by = frm %>% group_by(Sex, Height)

str(after.group_by)

Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  4
variables:
 $ Id    : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
 $ Sex   : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
 $ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
 $ Value : num  69.5 64.6 74.8 73.3 64.8 ...
 - attr(*, "vars")=List of 2
  ..$ : symbol Sex
  ..$ : symbol Height
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 5
  ..$ : int  1 6 9
  ..$ : int 2
  ..$ : int  0 4 8
  ..$ : int  3 7
  ..$ : int 5
 - attr(*, "group_sizes")= int  3 1 3 2 1
 - attr(*, "biggest_group_size")= int 3
 - attr(*, "labels")='data.frame':      5 obs. of  2 variables:
  ..$ Sex   : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
  ..$ Height: Factor w/ 3 levels "Short","Medium",..: 1 3 1 2 3
  ..- attr(*, "vars")=List of 2
  .. ..$ : symbol Sex
  .. ..$ : symbol Height

# the two data structure *seem* to be the same ...

frm == after.group_by

        Id  Sex Height Value
 [1,] TRUE TRUE   TRUE  TRUE
 [2,] TRUE TRUE   TRUE  TRUE
 [3,] TRUE TRUE   TRUE  TRUE
   ...etc.

# ... but they're not

frm[4]

   Value
1  69.47
2  64.61
   ...etc.

after.group_by[4]

Error in eval(expr, envir, enclos) : index out of bounds

# fortunately, we can convert back to a true data frame

as.data.frame(after.group_by)[4]

   Value
1  69.47
2  64.61
   ...etc.

################################## dput() code below

structure(list(Id = structure(1:10, .Label = c("P01", "P02",
"P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class = "factor"),
    Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label =
c("Female",
    "Male"), class = "factor"), Height = structure(c(1L, 1L,
    3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium",
    "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77,
    73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id",
"Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")

-----Original Message-----
From: John Kane [mailto:jrkrideau at inbox.com <javascript:;>]
Sent: Friday, November 21, 2014 12:33 PM
To: John Posner; 'r-help at r-project.org <javascript:;>'
Subject: RE: [R] dplyr/summarize does not create a true data frame

Your code in creating 'frm' is not working for me and it is complicated

enough

that I don't want to work it out. See ?dput for a better way to supply

data.

Also see:
https://github.com/hadley/devtools/wiki/Reproducibility
 http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-
reproducible-example

That said, I don't see why 'my.output[4]' is not working.  Try something

like

str(frm) to see what you have there and/or resubmit the data in dput

format

See simple example below:

dat1  <- data.frame(aa = sample(1:20, 100, replace = TRUE), bb = 1:100 )
dat1[2]

John Kane
Kingston ON Canada

-----Original Message-----
From: john.posner at mjbiostat.com <javascript:;>
Sent: Fri, 21 Nov 2014 17:10:16 +0000
To: r-help at r-project.org <javascript:;>
Subject: [R] dplyr/summarize does not create a true data frame

I got an error when trying to extract a 1-column subset of a data
frame (called "my.output") created by dplyr/summarize. The ncol()
function says that my.output has 4 columns, but "my.output[4]" fails.
Note that converting my.output using as.data.frame() makes for a happy

ending.

Is this the intended behavior of dplyr?

Tx,
John

library(dplyr)

# set up data frame
rows = 100
repcnt = 50
sexes = c("Female", "Male")
heights = c("Med", "Short", "Tall")

frm = data.frame(

+   Id = paste("P", sprintf("%04d", 1:rows), sep=""),
+   Sex = sample(rep(sexes, repcnt), rows, replace=T),
+   Height = sample(rep(heights, repcnt), rows, replace=T),
+   V1 = round(runif(rows)*25, 2) + 50,
+   V2 = round(runif(rows)*1000, 2) + 50,
+   V3 = round(runif(rows)*350, 2) - 175
+ )

# use dplyr/summarize to create data frame my.output = frm %>%

+   group_by(Sex, Height) %>%
+   summarize(V1sum=sum(V1), V2sum=sum(V2))

# work with columns in the output data frame
ncol(my.output)

[1] 4

my.output[1]

Source: local data frame [6 x 1]
Groups: Sex

     Sex
1 Female
2 Female
3 Female
4   Male
5   Male
6   Male

my.output[4]

Error in eval(expr, envir, enclos) : index out of bounds  ########
ERROR HERE

as.data.frame(my.output)[4]

______________________________________________
R-help at r-project.org <javascript:;> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__________________________________________________________
__
FREE ONLINE PHOTOSHARING - Share your photos online with your friends
and family!
Visit http://www.inbox.com/photosharing to find out more!

______________________________________________
R-help at r-project.org <javascript:;> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

http://had.co.nz/

	[[alternative HTML version deleted]]