"PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
on 28 Jun 2005 14:57:42 +0200 writes:
PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
>> The issue is not with boxplot, but with split. boxplot.formula()
>> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
>> but look at what split() returns when there are empty levels in
>> the factor:
>>
>> > f <- factor(gl(3, 6), levels=1:5)
>> > y <- rnorm(f)
>> > split(y, f)
>> $"1"
>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
>>
>> $"2"
>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
>>
>> $"3"
>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
>>
>> The "culprit" is the following in split.default():
>>
>> f <- factor(f)
>>
>> which drops empty levels in f, if there are any. BTW, ?split doesn't
>> mention what it does in such situation. Perhaps it should?
>>
>> If this is to be "fixed", I suppose an additional argument, e.g.,
>> drop=TRUE, can be added, and the corresponding line mentioned
>> above changed to something like:
>>
>> if (drop || !is.factor(f)) f <- factor(f)
>>
>> Then this additional argument can be pass on from boxplot.formula() to
>> split().
PD> Alternatively, I suspect that the intention was as.factor() rather
PD> than factor().
at first I thought Peter was right; but the real source of
split.default contains a comment (!) and that line is
f <- factor(f) # drop extraneous levels
so it seems, this was done there very much on purpose.
OTOH, S(-plus) has implemented it quite a bit differently, and actually
does keep the empty levels in the example
f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)
PD> It does require a bit of care to fix it that way,
PD> though. There could be problems with empty levels popping up in
PD> unexpected places.
Indeed!
Given the new facts, I think we want to go in Andy's direction
with a new argument, 'drop'
A Peter mentioned, the real question is about its default.
"drop = TRUE" would be fully compatible with previous versions of R.
"drop = FALSE" would be compatible with S and S-plus.
I'm going to implement it, and try to see if 'drop = FALSE'
gives changes for R and its standard packages; if 'yes', that
would be an indication that such a R-back-compatibility breaking
change was not a good idea. If 'no', I could commit it and see
if it has an effect on the CRAN packages....
Of course, since split() and split()<- are S3 generics, and
since there's also unsplit(), this entails a whole slew of
changes {adding a "drop = FALSE" argument everywhere!}
and I presume will break everyone's code who has written own
split.foobar methods....
great...
Martin
[ Hmm, is everyone of those interested in changes inside R "sleeping" ,
uninterested, ...
]
"MM" == Martin Maechler <maechler at stat.math.ethz.ch>
on Fri, 1 Jul 2005 18:36:54 +0200 writes:
"PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
on 28 Jun 2005 14:57:42 +0200 writes:
PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
>>> The issue is not with boxplot, but with split. boxplot.formula()
>>> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
>>> but look at what split() returns when there are empty levels in
>>> the factor:
>>>
>>> > f <- factor(gl(3, 6), levels=1:5)
>>> > y <- rnorm(f)
>>> > split(y, f)
>>> $"1"
>>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
>>>
>>> $"2"
>>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
>>>
>>> $"3"
>>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
>>>
>>> The "culprit" is the following in split.default():
>>>
>>> f <- factor(f)
>>>
>>> which drops empty levels in f, if there are any. BTW, ?split doesn't
>>> mention what it does in such situation. Perhaps it should?
>>>
>>> If this is to be "fixed", I suppose an additional argument, e.g.,
>>> drop=TRUE, can be added, and the corresponding line mentioned
>>> above changed to something like:
>>>
>>> if (drop || !is.factor(f)) f <- factor(f)
>>>
>>> Then this additional argument can be pass on from boxplot.formula() to
>>> split().
PD> Alternatively, I suspect that the intention was as.factor() rather
PD> than factor().
MM> at first I thought Peter was right; but the real source of
MM> split.default contains a comment (!) and that line is
MM> f <- factor(f) # drop extraneous levels
MM> so it seems, this was done there very much on purpose.
MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually
MM> does keep the empty levels in the example
MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)
PD> It does require a bit of care to fix it that way,
PD> though. There could be problems with empty levels popping up in
PD> unexpected places.
MM> Indeed!
MM> Given the new facts, I think we want to go in Andy's direction
MM> with a new argument, 'drop'
MM> A Peter mentioned, the real question is about its default.
MM> "drop = TRUE" would be fully compatible with previous versions of R.
MM> "drop = FALSE" would be compatible with S and S-plus.
MM> I'm going to implement it, and try to see if 'drop = FALSE'
MM> gives changes for R and its standard packages; if 'yes', that
MM> would be an indication that such a R-back-compatibility breaking
MM> change was not a good idea. If 'no', I could commit it and see
MM> if it has an effect on the CRAN packages....
MM> Of course, since split() and split()<- are S3 generics, and
MM> since there's also unsplit(), this entails a whole slew of
MM> changes {adding a "drop = FALSE" argument everywhere!}
MM> and I presume will break everyone's code who has written own
MM> split.foobar methods....
MM> great...
MM> Martin
The change doesn't seem to affect the "standard" packages at all
which is good. On CRAN, it seems there are two packages only that
have split() or split()<- methods, namely 'spatstat' and 'compositions'.
If we introduced the extra argument 'drop',
these and every other user code defining split methods would
have to be updated to be compatible with the changed (S3)
generic having an extra argument 'drop'.
With this in mind, after more thought, I think that Peter's
initial proposal ---just replacing 'factor()' by 'as.factor()'
inside split--- seems to be nicer than introducing 'drop' and
*change* the default behavior to 'drop = FALSE' for the
following reasons :
1) people who rely on the current behavior would have to change
their calls to split() anyway;
2) instead of calling
split(x, f, drop=TRUE)
they can as well go for
split(x, factor(f))
which has identical effect but does not introduce an extra
argument 'drop'.
3) advantage of slightly higher compatibility with S
---
I intend to change this in R-devel
{with appropriate notes in NEWS !} during this week, unless
someone finds good reasons for a different (or no) change.
Martin
I have now committed the new split(x, f, drop = FALSE)
to R-devel --- entailing non-backward compatible behavior,
but consistency with factor indexing (and with S-plus) ---
split() and "split<-" and unsplit() functions and methods to
R-devel.
This does automatically fix the original posters "boxplot by
factor" bug.
"MM" == Martin Maechler <maechler at stat.math.ethz.ch>
on Mon, 4 Jul 2005 09:15:59 +0200 writes:
"PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
on 28 Jun 2005 14:57:42 +0200 writes:
PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
>>>> The issue is not with boxplot, but with split. boxplot.formula()
>>>> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
>>>> but look at what split() returns when there are empty levels in
>>>> the factor:
>>>>
>>>> > f <- factor(gl(3, 6), levels=1:5)
>>>> > y <- rnorm(f)
>>>> > split(y, f)
>>>> $"1"
>>>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
>>>>
>>>> $"2"
>>>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
>>>>
>>>> $"3"
>>>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
>>>>
>>>> The "culprit" is the following in split.default():
>>>>
>>>> f <- factor(f)
>>>>
>>>> which drops empty levels in f, if there are any. BTW, ?split doesn't
>>>> mention what it does in such situation. Perhaps it should?
>>>>
>>>> If this is to be "fixed", I suppose an additional argument, e.g.,
>>>> drop=TRUE, can be added, and the corresponding line mentioned
>>>> above changed to something like:
>>>>
>>>> if (drop || !is.factor(f)) f <- factor(f)
>>>>
>>>> Then this additional argument can be pass on from boxplot.formula() to
>>>> split().
PD> Alternatively, I suspect that the intention was as.factor() rather
PD> than factor().
MM> at first I thought Peter was right; but the real source of
MM> split.default contains a comment (!) and that line is
MM> f <- factor(f) # drop extraneous levels
MM> so it seems, this was done there very much on purpose.
MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually
MM> does keep the empty levels in the example
MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)
PD> It does require a bit of care to fix it that way,
PD> though. There could be problems with empty levels popping up in
PD> unexpected places.
MM> Indeed!
MM> Given the new facts, I think we want to go in Andy's direction
MM> with a new argument, 'drop'
MM> A Peter mentioned, the real question is about its default.
MM> "drop = TRUE" would be fully compatible with previous versions of R.
MM> "drop = FALSE" would be compatible with S and S-plus.
MM> I'm going to implement it, and try to see if 'drop = FALSE'
MM> gives changes for R and its standard packages; if 'yes', that
MM> would be an indication that such a R-back-compatibility breaking
MM> change was not a good idea. If 'no', I could commit it and see
MM> if it has an effect on the CRAN packages....
MM> Of course, since split() and split()<- are S3 generics, and
MM> since there's also unsplit(), this entails a whole slew of
MM> changes {adding a "drop = FALSE" argument everywhere!}
MM> and I presume will break everyone's code who has written own
MM> split.foobar methods....
MM> great...
MM> Martin
MM> The change doesn't seem to affect the "standard" packages at all
MM> which is good. On CRAN, it seems there are two packages only that
MM> have split() or split()<- methods, namely 'spatstat' and 'compositions'.
MM> If we introduced the extra argument 'drop',
MM> these and every other user code defining split methods would
MM> have to be updated to be compatible with the changed (S3)
MM> generic having an extra argument 'drop'.
MM> With this in mind, after more thought, I think that Peter's
MM> initial proposal ---just replacing 'factor()' by 'as.factor()'
MM> inside split--- seems to be nicer than introducing 'drop' and
MM> *change* the default behavior to 'drop = FALSE' for the
MM> following reasons :
MM> 1) people who rely on the current behavior would have to change
MM> their calls to split() anyway;
MM> 2) instead of calling
MM> split(x, f, drop=TRUE)
MM> they can as well go for
MM> split(x, factor(f))
MM> which has identical effect but does not introduce an extra
MM> argument 'drop'.
MM> 3) advantage of slightly higher compatibility with S
MM> ---
MM> I intend to change this in R-devel
MM> {with appropriate notes in NEWS !} during this week, unless
MM> someone finds good reasons for a different (or no) change.
MM> Martin
MM> ______________________________________________
MM> R-devel at r-project.org mailing list
MM> https://stat.ethz.ch/mailman/listinfo/r-devel
MM> !DSPAM:42c8e272288132092019954!