Assuming I have a matrix of data (or under some restrictions that will
become obvious, possibly a data frame), I want to be able to apply a
list of functions (initially producing a single number from a vector)
to the data and produce a data frame (for compact output) with column
1 being the function results for the first function, column 2 being
the results for the second function and so on - with each row being
the columns of the original data.
The obvious application of this is to produce summaries of data sets
(a bit like summary() does on numeric matrices), but with user
supplied functions. I am content for the moment to leave it to the
user to supply functions that work with the data they supply so as to
produce results that will actually be data-frame-able, though I'd like
to ultimately make it a bit nicer than it currently is without
compromising the niceness of the output in the "good" cases.
The example below is a simplistic approach to this problem (it should
run as is). I have named it "fapply" for fairly obvious reasons, but
added the ".1" because it doesn't accept multidimensional arrays. I
have included the output I generated, which is what I want. There are
some obvious generalizations (e.g. being able to include functions
like range(), say, that produce several values on a vector, rather
than one, making the user's life simpler when a function already does
most of what they need).
The question is: this looks like a silly approach, growing a list
inside a for loop. Also I recall reading that if you find yourself
using "do.call" you should probably be doing something else.
So my question: Is there a better way to implement a function like this?
Or, even better, is there already a function that does this?
## example function and code to apply a list of functions to a matrix
(here a numeric data frame)
library(datasets)
fapply.1 <- function(x, fun.l, colnames=fun.l){
out.l <- list() ? # starts with an empty list
for (i in seq_along(fun.l)) out.l[[i]] <- apply(x,2,fun.l[[i]]) ? #
loop through list of functions
# set up names and make into a data frame
names(out.l) <- colnames
attr(out.l,"row.names") <- names(out.l[[1]])
attr(out.l,"class") <- "data.frame"
out.l
}
skewness <- function(x) mean(scale(x)^3) ? ? ?#define a simple numeric function
mean.gt.med <- function(x) mean(x)>median(x) ?# define a simple non-numeric fn
flist <- c("mean","sd","skewness","median","mean.gt.med") # make list
of fns to apply
fapply.1(attitude,flist)
? ? ? ? ? ? ? mean ? ? ? ?sd ? ?skewness median mean.gt.med
rating ? ? 64.63333 12.172562 -0.35792491 ? 65.5 ? ? ? FALSE
complaints 66.60000 13.314757 -0.21541749 ? 65.0 ? ? ? ?TRUE
privileges 53.13333 12.235430 ?0.37912287 ? 51.5 ? ? ? ?TRUE
learning ? 56.36667 11.737013 -0.05403354 ? 56.5 ? ? ? FALSE
raises ? ? 64.63333 10.397226 ?0.19754317 ? 63.5 ? ? ? ?TRUE
critical ? 74.76667 ?9.894908 -0.86577893 ? 77.5 ? ? ? FALSE
advance ? ?42.93333 10.288706 ?0.85039799 ? 41.0 ? ? ? ?TRUE
## end code and output
So did I miss something obvious?
Any suggestions as far as style or simple stability-enhancing
improvements would be handy.
regards,
Glen
best way to apply a list of functions to a dataset ?
8 messages · Glen Barnett, David Winsemius, Dennis Murphy +2 more
Erk. Sorry about the wrapping issue on the comments in the code, which will interfere with a straight copypaste. Glen
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100720/3831fd0d/attachment.pl>
Hi Dennis, Thanks for the reply. Yes, that's easier, but the conversion to a matrix with rbind has converted the output of that final function to a numeric. I included that last function in the example secifically to preclude people assuming that functions would always return the same type. I guess this doesn't matter too much for a logical, but what if instead the function returned a character (say "mean", "median", or "equal" - indicating which one was larger, or "equal" which could easily happen with discrete data). This precludes using rbind (which I also used at first, before I noticed that sometimes I could have functions that don't return numerics). Glen
On Tue, Jul 20, 2010 at 6:55 PM, Dennis Murphy <djmuser at gmail.com> wrote:
Hi: This might be a little easier (?): library(datasets) skewness <- function(x) mean(scale(x)^3) mean.gt.med <- function(x) mean(x)>median(x) # ------ # construct the function to apply to each variable in the data frame f <- function(x) c(mean = mean(x), sd = sd(x), skewness = skewness(x), ???? median = median(x), mean.gt.med = mean.gt.med(x)) # map function to each variable with lapply and combine with do.call(): do.call(rbind, lapply(attitude, f)) ?????????????? mean??????? sd??? skewness median mean.gt.med rating???? 64.63333 12.172562 -0.35792491?? 65.5?????????? 0 complaints 66.60000 13.314757 -0.21541749?? 65.0?????????? 1 privileges 53.13333 12.235430? 0.37912287?? 51.5?????????? 1 learning?? 56.36667 11.737013 -0.05403354?? 56.5?????????? 0 raises???? 64.63333 10.397226? 0.19754317?? 63.5?????????? 1 critical?? 74.76667? 9.894908 -0.86577893?? 77.5?????????? 0 advance??? 42.93333 10.288706? 0.85039799?? 41.0?????????? 1 HTH, Dennis On Mon, Jul 19, 2010 at 10:51 PM, Glen Barnett <glnbrntt at gmail.com> wrote:
Assuming I have a matrix of data (or under some restrictions that will
become obvious, possibly a data frame), I want to be able to apply a
list of functions (initially producing a single number from a vector)
to the data and produce a data frame (for compact output) with column
1 being the function results for the first function, column 2 being
the results for the second function and so on - with each row being
the columns of the original data.
The obvious application of this is to produce summaries of data sets
(a bit like summary() does on numeric matrices), but with user
supplied functions. I am content for the moment to leave it to the
user to supply functions that work with the data they supply so as to
produce results that will actually be data-frame-able, though I'd like
to ultimately make it a bit nicer than it currently is without
compromising the niceness of the output in the "good" cases.
The example below is a simplistic approach to this problem (it should
run as is). I have named it "fapply" for fairly obvious reasons, but
added the ".1" because it doesn't accept multidimensional arrays. I
have included the output I generated, which is what I want. There are
some obvious generalizations (e.g. being able to include functions
like range(), say, that produce several values on a vector, rather
than one, making the user's life simpler when a function already does
most of what they need).
The question is: this looks like a silly approach, growing a list
inside a for loop. Also I recall reading that if you find yourself
using "do.call" you should probably be doing something else.
So my question: Is there a better way to implement a function like this?
Or, even better, is there already a function that does this?
## example function and code to apply a list of functions to a matrix
(here a numeric data frame)
library(datasets)
fapply.1 <- function(x, fun.l, colnames=fun.l){
out.l <- list() ? # starts with an empty list
for (i in seq_along(fun.l)) out.l[[i]] <- apply(x,2,fun.l[[i]]) ? #
loop through list of functions
# set up names and make into a data frame
names(out.l) <- colnames
attr(out.l,"row.names") <- names(out.l[[1]])
attr(out.l,"class") <- "data.frame"
out.l
}
skewness <- function(x) mean(scale(x)^3) ? ? ?#define a simple numeric
function
mean.gt.med <- function(x) mean(x)>median(x) ?# define a simple
non-numeric fn
flist <- c("mean","sd","skewness","median","mean.gt.med") # make list
of fns to apply
fapply.1(attitude,flist)
? ? ? ? ? ? ? mean ? ? ? ?sd ? ?skewness median mean.gt.med
rating ? ? 64.63333 12.172562 -0.35792491 ? 65.5 ? ? ? FALSE
complaints 66.60000 13.314757 -0.21541749 ? 65.0 ? ? ? ?TRUE
privileges 53.13333 12.235430 ?0.37912287 ? 51.5 ? ? ? ?TRUE
learning ? 56.36667 11.737013 -0.05403354 ? 56.5 ? ? ? FALSE
raises ? ? 64.63333 10.397226 ?0.19754317 ? 63.5 ? ? ? ?TRUE
critical ? 74.76667 ?9.894908 -0.86577893 ? 77.5 ? ? ? FALSE
advance ? ?42.93333 10.288706 ?0.85039799 ? 41.0 ? ? ? ?TRUE
## end code and output
So did I miss something obvious?
Any suggestions as far as style or simple stability-enhancing
improvements would be handy.
regards,
Glen
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Jul 20, 2010, at 8:37 PM, Glen Barnett wrote:
Hi Dennis, Thanks for the reply. Yes, that's easier, but the conversion to a matrix with rbind has converted the output of that final function to a numeric. I included that last function in the example secifically to preclude people assuming that functions would always return the same type. I guess this doesn't matter too much for a logical, but what if instead the function returned a character (say "mean", "median", or "equal" - indicating which one was larger, or "equal" which could easily happen with discrete data). This precludes using rbind (which I also used at first, before I noticed that sometimes I could have functions that don't return numerics).
Have you considered the summaryBy function in package doBy? (Why reinvent the wheel?)
David.
>
> Glen
>
>
> On Tue, Jul 20, 2010 at 6:55 PM, Dennis Murphy <djmuser at gmail.com>
> wrote:
>> Hi:
>>
>> This might be a little easier (?):
>>
>> library(datasets)
>> skewness <- function(x) mean(scale(x)^3)
>> mean.gt.med <- function(x) mean(x)>median(x)
>>
>> # ------
>> # construct the function to apply to each variable in the data frame
>> f <- function(x) c(mean = mean(x), sd = sd(x), skewness =
>> skewness(x),
>> median = median(x), mean.gt.med = mean.gt.med(x))
>>
>> # map function to each variable with lapply and combine with
>> do.call():
>> do.call(rbind, lapply(attitude, f))
>> mean sd skewness median mean.gt.med
>> rating 64.63333 12.172562 -0.35792491 65.5 0
>> complaints 66.60000 13.314757 -0.21541749 65.0 1
>> privileges 53.13333 12.235430 0.37912287 51.5 1
>> learning 56.36667 11.737013 -0.05403354 56.5 0
>> raises 64.63333 10.397226 0.19754317 63.5 1
>> critical 74.76667 9.894908 -0.86577893 77.5 0
>> advance 42.93333 10.288706 0.85039799 41.0 1
>>
>> HTH,
>> Dennis
>>
>>
>> On Mon, Jul 19, 2010 at 10:51 PM, Glen Barnett <glnbrntt at gmail.com>
>> wrote:
>>>
>>> Assuming I have a matrix of data (or under some restrictions that
>>> will
>>> become obvious, possibly a data frame), I want to be able to apply a
>>> list of functions (initially producing a single number from a
>>> vector)
>>> to the data and produce a data frame (for compact output) with
>>> column
>>> 1 being the function results for the first function, column 2 being
>>> the results for the second function and so on - with each row being
>>> the columns of the original data.
>>>
>>> The obvious application of this is to produce summaries of data sets
>>> (a bit like summary() does on numeric matrices), but with user
>>> supplied functions. I am content for the moment to leave it to the
>>> user to supply functions that work with the data they supply so as
>>> to
>>> produce results that will actually be data-frame-able, though I'd
>>> like
>>> to ultimately make it a bit nicer than it currently is without
>>> compromising the niceness of the output in the "good" cases.
>>>
>>> The example below is a simplistic approach to this problem (it
>>> should
>>> run as is). I have named it "fapply" for fairly obvious reasons, but
>>> added the ".1" because it doesn't accept multidimensional arrays. I
>>> have included the output I generated, which is what I want. There
>>> are
>>> some obvious generalizations (e.g. being able to include functions
>>> like range(), say, that produce several values on a vector, rather
>>> than one, making the user's life simpler when a function already
>>> does
>>> most of what they need).
>>>
>>> The question is: this looks like a silly approach, growing a list
>>> inside a for loop. Also I recall reading that if you find yourself
>>> using "do.call" you should probably be doing something else.
>>>
>>> So my question: Is there a better way to implement a function like
>>> this?
>>>
>>> Or, even better, is there already a function that does this?
>>>
>>> ## example function and code to apply a list of functions to a
>>> matrix
>>> (here a numeric data frame)
>>>
>>> library(datasets)
>>>
>>> fapply.1 <- function(x, fun.l, colnames=fun.l){
>>> out.l <- list() # starts with an empty list
>>> for (i in seq_along(fun.l)) out.l[[i]] <- apply(x,2,fun.l[[i]]) #
>>> loop through list of functions
>>>
>>> # set up names and make into a data frame
>>> names(out.l) <- colnames
>>> attr(out.l,"row.names") <- names(out.l[[1]])
>>> attr(out.l,"class") <- "data.frame"
>>> out.l
>>> }
>>>
>>> skewness <- function(x) mean(scale(x)^3) #define a simple
>>> numeric
>>> function
>>> mean.gt.med <- function(x) mean(x)>median(x) # define a simple
>>> non-numeric fn
>>> flist <- c("mean","sd","skewness","median","mean.gt.med") # make
>>> list
>>> of fns to apply
>>>
>>> fapply.1(attitude,flist)
>>> mean sd skewness median mean.gt.med
>>> rating 64.63333 12.172562 -0.35792491 65.5 FALSE
>>> complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
>>> privileges 53.13333 12.235430 0.37912287 51.5 TRUE
>>> learning 56.36667 11.737013 -0.05403354 56.5 FALSE
>>> raises 64.63333 10.397226 0.19754317 63.5 TRUE
>>> critical 74.76667 9.894908 -0.86577893 77.5 FALSE
>>> advance 42.93333 10.288706 0.85039799 41.0 TRUE
>>>
>>> ## end code and output
>>>
>>> So did I miss something obvious?
>>>
>>> Any suggestions as far as style or simple stability-enhancing
>>> improvements would be handy.
>>>
>>> regards,
>>> Glen
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100720/0ef9efa0/attachment.pl>
Dennis' ddply solution would be my choice. Here is
a small variation that makes it easy to modify what
list of functions is applied:
#----
ma<- melt(attitude)
f <- function(x,v) summarise(x,
mean = mean(v),
sd = sd(v),
skewness = skewness(v),
mean.gt.med = mean.gt.med(v)
)
ddply(ma, .(variable), function(x) f(x, v = x[["value"]]))
#----
Another option is to use data.frame in place of summarise:
#----
f <- function(x,v) data.frame(
mean = mean(v),
sd = sd(v),
skewness = skewness(v),
mean.gt.med = mean.gt.med(v)
)
#----
-Peter Ehlers
On 2010-07-21 0:41, Dennis Murphy wrote:
Hi: On Tue, Jul 20, 2010 at 5:37 PM, Glen Barnett<glnbrntt at gmail.com> wrote:
Hi Dennis, Thanks for the reply. Yes, that's easier, but the conversion to a matrix with rbind has converted the output of that final function to a numeric.
If you look at the output of lapply(attitude, f), you'll see that the conversion from logical to numeric has already taken place. Different components of lists can have different types, but within a component, all of the elements must have the same class. You can patch up the result as follows: x<- data.frame(do.call(rbind, lapply(attitude, f))) x[, 5]<- as.logical(x[, 5])
x
mean sd skewness median mean.gt.med rating 64.63333 12.172562 -0.35792491 65.5 FALSE complaints 66.60000 13.314757 -0.21541749 65.0 TRUE privileges 53.13333 12.235430 0.37912287 51.5 TRUE learning 56.36667 11.737013 -0.05403354 56.5 FALSE raises 64.63333 10.397226 0.19754317 63.5 TRUE critical 74.76667 9.894908 -0.86577893 77.5 FALSE advance 42.93333 10.288706 0.85039799 41.0 TRUE but if you're doing this sort of thing over a large data frame such fixes may be impractical. I included that last function in the example secifically to preclude
people assuming that functions would always return the same type.
There is a plyr solution, although it's a little more long-winded than I'd prefer in the end: library(ggplot2) # melt the data frame so that the variables become factor levels ma<- melt(attitude) Using as id variables
dim(ma)
[1] 210 2
# Use ddply to get the set of summaries by variable:
ddply(ma, .(variable), summarise, mean = mean(value), sd = sd(value),
skewness = skewness(value), median = median(value),
mean.gt.med = mean.gt.med(value))
variable mean sd skewness median mean.gt.med
1 rating 64.63333 12.172562 -0.35792491 65.5 FALSE
2 complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
3 privileges 53.13333 12.235430 0.37912287 51.5 TRUE
4 learning 56.36667 11.737013 -0.05403354 56.5 FALSE
5 raises 64.63333 10.397226 0.19754317 63.5 TRUE
6 critical 74.76667 9.894908 -0.86577893 77.5 FALSE
7 advance 42.93333 10.288706 0.85039799 41.0 TRUE
Notice that now the logical class of mean.gt.med is preserved. The trick
with ddply() in package plyr is that, in this case, it is convenient to melt
the data frame first before doing the summarizations. This is because
ddply() requires a grouping variable - in this example, the groups are the
variables themselves.
HTH,
Dennis
I guess this doesn't matter too much for a logical, but what if
instead the function returned a character (say "mean", "median", or "equal" - indicating which one was larger, or "equal" which could easily happen with discrete data). This precludes using rbind (which I also used at first, before I noticed that sometimes I could have functions that don't return numerics). Glen On Tue, Jul 20, 2010 at 6:55 PM, Dennis Murphy<djmuser at gmail.com> wrote:
Hi:
This might be a little easier (?):
library(datasets)
skewness<- function(x) mean(scale(x)^3)
mean.gt.med<- function(x) mean(x)>median(x)
# ------
# construct the function to apply to each variable in the data frame
f<- function(x) c(mean = mean(x), sd = sd(x), skewness = skewness(x),
median = median(x), mean.gt.med = mean.gt.med(x))
# map function to each variable with lapply and combine with do.call():
do.call(rbind, lapply(attitude, f))
mean sd skewness median mean.gt.med
rating 64.63333 12.172562 -0.35792491 65.5 0
complaints 66.60000 13.314757 -0.21541749 65.0 1
privileges 53.13333 12.235430 0.37912287 51.5 1
learning 56.36667 11.737013 -0.05403354 56.5 0
raises 64.63333 10.397226 0.19754317 63.5 1
critical 74.76667 9.894908 -0.86577893 77.5 0
advance 42.93333 10.288706 0.85039799 41.0 1
HTH,
Dennis
On Mon, Jul 19, 2010 at 10:51 PM, Glen Barnett<glnbrntt at gmail.com>
wrote:
Assuming I have a matrix of data (or under some restrictions that will
become obvious, possibly a data frame), I want to be able to apply a
list of functions (initially producing a single number from a vector)
to the data and produce a data frame (for compact output) with column
1 being the function results for the first function, column 2 being
the results for the second function and so on - with each row being
the columns of the original data.
The obvious application of this is to produce summaries of data sets
(a bit like summary() does on numeric matrices), but with user
supplied functions. I am content for the moment to leave it to the
user to supply functions that work with the data they supply so as to
produce results that will actually be data-frame-able, though I'd like
to ultimately make it a bit nicer than it currently is without
compromising the niceness of the output in the "good" cases.
The example below is a simplistic approach to this problem (it should
run as is). I have named it "fapply" for fairly obvious reasons, but
added the ".1" because it doesn't accept multidimensional arrays. I
have included the output I generated, which is what I want. There are
some obvious generalizations (e.g. being able to include functions
like range(), say, that produce several values on a vector, rather
than one, making the user's life simpler when a function already does
most of what they need).
The question is: this looks like a silly approach, growing a list
inside a for loop. Also I recall reading that if you find yourself
using "do.call" you should probably be doing something else.
So my question: Is there a better way to implement a function like this?
Or, even better, is there already a function that does this?
## example function and code to apply a list of functions to a matrix
(here a numeric data frame)
library(datasets)
fapply.1<- function(x, fun.l, colnames=fun.l){
out.l<- list() # starts with an empty list
for (i in seq_along(fun.l)) out.l[[i]]<- apply(x,2,fun.l[[i]]) #
loop through list of functions
# set up names and make into a data frame
names(out.l)<- colnames
attr(out.l,"row.names")<- names(out.l[[1]])
attr(out.l,"class")<- "data.frame"
out.l
}
skewness<- function(x) mean(scale(x)^3) #define a simple numeric
function
mean.gt.med<- function(x) mean(x)>median(x) # define a simple
non-numeric fn
flist<- c("mean","sd","skewness","median","mean.gt.med") # make list
of fns to apply
fapply.1(attitude,flist)
mean sd skewness median mean.gt.med
rating 64.63333 12.172562 -0.35792491 65.5 FALSE
complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
privileges 53.13333 12.235430 0.37912287 51.5 TRUE
learning 56.36667 11.737013 -0.05403354 56.5 FALSE
raises 64.63333 10.397226 0.19754317 63.5 TRUE
critical 74.76667 9.894908 -0.86577893 77.5 FALSE
advance 42.93333 10.288706 0.85039799 41.0 TRUE
## end code and output
So did I miss something obvious?
Any suggestions as far as style or simple stability-enhancing
improvements would be handy.
regards,
Glen
ddply(ma, .(variable), summarise, mean = mean(value), sd = sd(value), ? ? ? skewness = skewness(value), median = median(value), ? ? ? mean.gt.med = mean.gt.med(value))
In principle, you should be able to do: ddply(ma, .(variable), colwise(each(mean, sd, skewness, median, mean.gt.med))) but currently colwise and each don't work together that well. Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/