An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111103/11b071bd/attachment.pl>
any updates w.r.t. lapply, sapply, apply retaining classes
7 messages · Mike Williamson, Joshua Wiley, Richard M. Heiberger +1 more
Hi Mike,
This isn't really an answer to your question, but perhaps will serve
to continue discussion. I think that there are some fundamental
issues when working special classes. As a thought example, suppose I
wrote a class, "posreal", which inherits from the numeric class. It
is only valid for positive, real numbers. I use it in a package, but
do not develop methods for it. A user comes along and creates a
vector, x that is a posreal. Then tries: mean(x * -3). Since I never
bothered to write a special method for mean for my class, R falls back
to the inherited numeric, but gives a value that is clearly not valid
for posreal. What should happen? S3 methods do not really have
validation, so in principle, one could write a function like:
f <- function(x) {
vclass <- class(x)
res <- mean(x)
class(res) <- vclass
return(res)
}
which "retains" the appropriate class, but in name only. R core
cannot possibly know or imagine all classes that may be written that
inherit from more basic types but with possible special aspects and
requirements. I think the inherited is considered to be more generic
and that is returned. It is usually up to the user to ensure that the
function (whose methods were not specific to that special class but
the inherited) is valid for that class and can manually convert it
back:
res <- as.posreal(res)
What about lapply and sapply? Neither are generic or have methods for
difftime, and so do some unexpected/desirable things. Again, without
methods defined for a particular class, they cannot know what is
special or appropriate way to handle it, they use defaults which
sometimes work but may give unexpected or undesirable results, but
what else can be done? (okay, they could just throw an error) If a
function is naive about a class, it does not seem right to operate on
it using unknown methods and then pretend to be returning the same
type of data. As it stands, they convert to a data type they know and
return that.
Now, you mention that for loops are slow in R, and this is true to a
degree. However, the *apply functions are basically just internal
loops, so they do not really save you (they are certainly not
vectorized!), though they are more elegant than explicit loops IMO.
One way to use them while retaining class would be like:
sapply(seq_along(test), function(i) class(test[i]))
this is less efficient then sapply(test, class), but the overhead
drops considerably as the function does nontrivial calculations.
Finally, I find the (relatively) new compiler package really shines at
making functions that are just wrappers for for loops more efficient.
Take a look at the examples from:
require(compiler)
?cmpfun
I am not familiar with numPy so I do not know how it handles new
classes, but with some tweaks to my workflow, I do not find myself
running into problems with how R handles them. I definitely
appreciate your position because I have been there...as I became more
familiar with R, classes, and methods, I find I work in a way that
avoids passing objects to functions that do not know how to handle
them properly.
Cheers,
Josh
On Thu, Nov 3, 2011 at 11:08 AM, Mike Williamson <this.is.mvw at gmail.com> wrote:
Hi All, ? ?I don't have a "I need help" question, so much as a query into any update whether 'R' has made any progress with some of the core functions retaining classes. ?As an example, because it's one of the cases that most egregiously impacts me & my work and keeps pushing me away from 'R' and into other numerical languages (such as NumPy in python), I will use sapply / lapply to demonstrate, but this behavior is ubiquitous throughout 'R'. ? ?Let's say I have a class which is theoretically supported, but not one of the core "numeric" or "character" classes (and, to some degree, "factor" classes). ?Many of the basic functions will convert my desired class into either numeric or character, so that my returned answer is gibberish. E.g.: test= as.difftime(c(1, 1, 8, 0.25, 8, 1.25), units= "days") ?## create a small array of time differences class(test) ?## this will return the proper class, "difftime" class(test[1] ) ## this will also return the proper class, "difftime" sapply(test, class) ?## this will return *numerics* for all of the classes. ?Ack!! ? ?In the example I give above, the impact might seem small, but the implications are *huge*. ?This means that I am, in effect, not allowed to use *any* of the vectoring functions in 'R', which avoid performing loops thereby speeding up process time extraordinarily. ?Many can sympathize that 'R' is ridiculously slow with "for" loops, compared to other languages. ?But that's theoretically OK, a good statistician or data analyst should be able to work comfortably with matrices and vectors. ?However, *'R' cannot work comfortably* with matrices or vectors, *unless* they are using the numeric or character classes. ?Many of the classes suffer the problem I just described, although I only used "difftime" in the example. ?Factors seem a bit more "comfortable", and can be handled most of the time, but not as well as numerics, and at times functions working on factors can return the numerical representation of the factor instead of the original factor. ? ?Is there any progress in guaranteeing that all core functions either (a) ideally return exactly the classes, and hierarchy of classes, that they received (e.g., a list of data frames with difftimes & dates & characters would return a list of data frames with difftimes & dates & characters), or (b) barring that, the function should at least error out with a clear error explaining that sapply, for example, cannot vectorize on the class being used? ?Returning incorrect answers is far worse than returning an error, from a perspective of stability. ? ?This is, by far, the largest Achilles' heel to 'R'. ?Personally, as my career advances and I work on more technical things, I am finding that I have to leave 'R' by the wayside and use other languages for robust numerical calculations and programming. ?This saddens me, because there are so many wonderful packages developed by the community. ?The example above came up because I am using the "forecast" library to great effect in predicting how long our product cycle time will be. ?However, I spend much of my time fighting all these class & typing bugs in 'R' (and we have to start recognizing that they are bugs, otherwise they may never get resolved), such that many of the improvements in my productivity due to all the wonderful computational packages are entirely offset by the time I spend fighting this issue of poor classes. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Thanks & Regards! ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike --- XKCD <http://www.xkcd.com> ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111103/7107337a/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111103/87e661be/attachment.pl>
? ?In the example I give above, the impact might seem small, but the implications are *huge*. ?This means that I am, in effect, not allowed to use *any* of the vectoring functions in 'R', which avoid performing loops thereby speeding up process time extraordinarily. ?Many can sympathize that 'R' is ridiculously slow with "for" loops, compared to other languages. ?But that's theoretically OK, a good statistician or data analyst should be able to work comfortably with matrices and vectors.
Two comments: * sapply is generally only _slightly_ faster than a for loop * it's almost always better to use vapply instead of sapply. But I agree that simplify2array should be a generic so that you can write custom methods to support new classes. Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
? ?I agree that it is non-trivial to solve the cases you & I have posed. ?However, I would wholeheartedly support having an error spit back for any function that does not explicitly support a class. ?In this case, if I attempt to do ? sapply(x, class), and 'x' is of class "difftime", then I should receive an error "sapply cannot function upon class 'difftime' ". ?Why do I take this stance? ?There are at least 2 strong reasons:
I don't see why that command should be a problem because class() returns a string. A better example might be sapply(x, identity) which in general you would hope to be identical to x: x <- structure(1:10, class = "blah") identical(x, sapply(x, identity)) # [1] FALSE Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
Hi Mike, I definitely understand your point. I don't have any particularly good ideas, though I think you might like S4, which is the newer formal class/methods system. As a note, I misspoke (or miswrote) that difftime inherits from numeric---the mode is numeric, but does not inherit. Cheers, Josh
On Thu, Nov 3, 2011 at 4:49 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:
Hi Joshua, ? ? Thank you for the input! ? ? I agree that it is non-trivial to solve the cases you & I have posed. ?However, I would wholeheartedly support having an error spit back for any function that does not explicitly support a class. ?In this case, if I attempt to do ? sapply(x, class), and 'x' is of class "difftime", then I should receive an error "sapply cannot function upon class 'difftime' ". ?Why do I take this stance? ?There are at least 2 strong reasons: Most importantly, an incorrect answer is far more dangerous than no answer. ?E.g., if I ask "what is 3 + 3?", I would far prefer to receive "I don't know" than "5". ?The former lets me know I need to choose another path, the latter mistakenly makes me think I have an answer, when I do not, and I continue with analyses on the assumption that answer is correct. ?In the case of dates, this happens often. ?E.g., is the numeric that is returned from sapply, for instance, the # of seconds since 1970-01-01, or the number of days since 1970-01-01. ?This depends upon how 'R' internally attempts to fix any incongruities. But also very significantly, an error will get me in the habit of avoiding any marginalized class types. ?I keep thinking, for instance, that I can use the "Dates" class, since 'R' says that it supports them. ?But if I got into the habit of converting all dates into numerics myself beforehand (maybe counting the number of seconds from 1970-01-01, since that seems a magic date), then I would be guaranteed that a function will either (a) cause an error (e.g., if I try a character function on it), or (b) function properly. ?However, since I don't overtly receive errors when attempting to use dates (or difftimes, or factors, or whatever), I keep using them, instead of relying solely upon the true & trusted classes. the trickiest here is really factors. ?Factors are, by most accounts, considered a core class. ?In some cases, you can only use factors. ?E.g., when you want some sort of ordinal categorical variable. ?Therefore, the fact that factors also barf similarly to other classes like difftime (albeit much more rarely), is especially dangerous. ? ? There are, of course, habits that we can create to make ourselves better programmers, and I will recognize that I can improve. ?However, this issue of functions generating "wrong" answers is such a huge?problem with 'R', and other languages are catching up to 'R' so quickly, as far as their capability to handle higher level math, that this issue is making 'R' a less desirable language to use, as time progresses. ?I don't mean to claim that my opinion is the end-all-be-all, but I would like to hear others chime in, whether this is a large concern, or whether there is a very small minority of folks impacted by it. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Regards, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike --- XKCD On Thu, Nov 3, 2011 at 2:51 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
Hi Mike,
This isn't really an answer to your question, but perhaps will serve
to continue discussion. ?I think that there are some fundamental
issues when working special classes. ?As a thought example, suppose I
wrote a class, "posreal", which inherits from the numeric class. ?It
is only valid for positive, real numbers. ?I use it in a package, but
do not develop methods for it. ?A user comes along and creates a
vector, x that is a posreal. ?Then tries: mean(x * -3). ?Since I never
bothered to write a special method for mean for my class, R falls back
to the inherited numeric, but gives a value that is clearly not valid
for posreal. ?What should happen? ?S3 methods do not really have
validation, so in principle, one could write a function like:
f <- function(x) {
?vclass <- class(x)
?res <- mean(x)
?class(res) <- vclass
?return(res)
}
which "retains" the appropriate class, but in name only. ?R core
cannot possibly know or imagine all classes that may be written that
inherit from more basic types but with possible special aspects and
requirements. ?I think the inherited is considered to be more generic
and that is returned. ?It is usually up to the user to ensure that the
function (whose methods were not specific to that special class but
the inherited) is valid for that class and can manually convert it
back:
res <- as.posreal(res)
What about lapply and sapply? ?Neither are generic or have methods for
difftime, and so do some unexpected/desirable things. ?Again, without
methods defined for a particular class, they cannot know what is
special or appropriate way to handle it, they use defaults which
sometimes work but may give unexpected or undesirable results, but
what else can be done? ?(okay, they could just throw an error) ?If a
function is naive about a class, it does not seem right to operate on
it using unknown methods and then pretend to be returning the same
type of data. ?As it stands, they convert to a data type they know and
return that.
Now, you mention that for loops are slow in R, and this is true to a
degree. ?However, the *apply functions are basically just internal
loops, so they do not really save you (they are certainly not
vectorized!), though they are more elegant than explicit loops IMO.
One way to use them while retaining class would be like:
sapply(seq_along(test), function(i) class(test[i]))
this is less efficient then sapply(test, class), but the overhead
drops considerably as the function does nontrivial calculations.
Finally, I find the (relatively) new compiler package really shines at
making functions that are just wrappers for for loops more efficient.
Take a look at the examples from:
require(compiler)
?cmpfun
I am not familiar with numPy so I do not know how it handles new
classes, but with some tweaks to my workflow, I do not find myself
running into problems with how R handles them. ?I definitely
appreciate your position because I have been there...as I became more
familiar with R, classes, and methods, I find I work in a way that
avoids passing objects to functions that do not know how to handle
them properly.
Cheers,
Josh
On Thu, Nov 3, 2011 at 11:08 AM, Mike Williamson <this.is.mvw at gmail.com>
wrote:
Hi All, ? ?I don't have a "I need help" question, so much as a query into any update whether 'R' has made any progress with some of the core functions retaining classes. ?As an example, because it's one of the cases that most egregiously impacts me & my work and keeps pushing me away from 'R' and into other numerical languages (such as NumPy in python), I will use sapply / lapply to demonstrate, but this behavior is ubiquitous throughout 'R'. ? ?Let's say I have a class which is theoretically supported, but not one of the core "numeric" or "character" classes (and, to some degree, "factor" classes). ?Many of the basic functions will convert my desired class into either numeric or character, so that my returned answer is gibberish. E.g.: test= as.difftime(c(1, 1, 8, 0.25, 8, 1.25), units= "days") ?## create a small array of time differences class(test) ?## this will return the proper class, "difftime" class(test[1] ) ## this will also return the proper class, "difftime" sapply(test, class) ?## this will return *numerics* for all of the classes. ?Ack!! ? ?In the example I give above, the impact might seem small, but the implications are *huge*. ?This means that I am, in effect, not allowed to use *any* of the vectoring functions in 'R', which avoid performing loops thereby speeding up process time extraordinarily. ?Many can sympathize that 'R' is ridiculously slow with "for" loops, compared to other languages. ?But that's theoretically OK, a good statistician or data analyst should be able to work comfortably with matrices and vectors. ?However, *'R' cannot work comfortably* with matrices or vectors, *unless* they are using the numeric or character classes. ?Many of the classes suffer the problem I just described, although I only used "difftime" in the example. ?Factors seem a bit more "comfortable", and can be handled most of the time, but not as well as numerics, and at times functions working on factors can return the numerical representation of the factor instead of the original factor. ? ?Is there any progress in guaranteeing that all core functions either (a) ideally return exactly the classes, and hierarchy of classes, that they received (e.g., a list of data frames with difftimes & dates & characters would return a list of data frames with difftimes & dates & characters), or (b) barring that, the function should at least error out with a clear error explaining that sapply, for example, cannot vectorize on the class being used? ?Returning incorrect answers is far worse than returning an error, from a perspective of stability. ? ?This is, by far, the largest Achilles' heel to 'R'. ?Personally, as my career advances and I work on more technical things, I am finding that I have to leave 'R' by the wayside and use other languages for robust numerical calculations and programming. ?This saddens me, because there are so many wonderful packages developed by the community. ?The example above came up because I am using the "forecast" library to great effect in predicting how long our product cycle time will be. ?However, I spend much of my time fighting all these class & typing bugs in 'R' (and we have to start recognizing that they are bugs, otherwise they may never get resolved), such that many of the improvements in my productivity due to all the wonderful computational packages are entirely offset by the time I spend fighting this issue of poor classes. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Thanks & Regards! ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike --- XKCD <http://www.xkcd.com> ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/