Unexpected behaviour of identical (PR#6799)

"Swinton, Jonathan" <Jonathan.Swinton@astrazeneca.com> writes:
 # works as expected
ac <- c('A','B');
identical(ac,ac[1:2])
[1] TRUE

 #but
af <- factor(ac)
identical(af,af[1:2])
[1] FALSE

Any opinions?
Did a cross-check with Splus and it doesn't do that , so I think it
qualifies as a bug. Shouldn't be too hard to fix (might lose a little
efficiencty though).
O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

"Swinton, Jonathan" <Jonathan.Swinton@astrazeneca.com> writes:

 # works as expected
ac <- c('A','B');
identical(ac,ac[1:2])
[1] TRUE

 #but
af <- factor(ac)
identical(af,af[1:2])
[1] FALSE

Any opinions?
Did a cross-check with Splus and it doesn't do that , so I think it
qualifies as a bug. Shouldn't be too hard to fix (might lose a little
efficiencty though).
No, it comes from
get("[.factor")
function (x, i, drop = FALSE)
{
    y <- NextMethod("[")
    class(y) <- oldClass(x)
    attr(y, "contrasts") <- attr(x, "contrasts")
    attr(y, "levels") <- attr(x, "levels")
    if (drop)
        factor(y)
    else y
}
attributes(af[1:2, drop=TRUE])
$levels
[1] "A" "B"

$class
[1] "factor"
attributes(af[1:2, drop=FALSE])
$class
[1] "factor"

$levels
[1] "A" "B"

and one needs to swap the orders.  I am about to commit the change.
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Prof Brian Ripley <ripley@stats.ox.ac.uk> writes:
No, it comes from

get("[.factor")
function (x, i, drop = FALSE)
{
    y <- NextMethod("[")
    class(y) <- oldClass(x)
    attr(y, "contrasts") <- attr(x, "contrasts")
    attr(y, "levels") <- attr(x, "levels")
    if (drop)
        factor(y)
    else y
}

attributes(af[1:2, drop=TRUE])
$levels
[1] "A" "B"

$class
[1] "factor"

attributes(af[1:2, drop=FALSE])
$class
[1] "factor"

$levels
[1] "A" "B"

and one needs to swap the orders.  I am about to commit the change.
I got to about the same spot and started thinking about methods
for putting attributes back in the same order that they were found, as
in

A <- attributes(x)
attributes(y) <- A[names(A) %in% c("class","contrasts","levels")]

Just swapping the order is probably fine, though.
O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
What about changing identical() to ignore the order of attributes?  Is 
there any code anywhere that depends on the order of attributes, other than 
identical()?  I've only seen attributes treated as an unordered set, and 
never as an ordered list.  There are some functions in S-plus that change 
the order of attributes, and the only thing this affects is 
identical().  (Which in S-plus also pays attention to the order of attributes.)

-- Tony Plate
"Swinton, Jonathan" <Jonathan.Swinton@astrazeneca.com> writes:

 # works as expected
ac <- c('A','B');
identical(ac,ac[1:2])
[1] TRUE

 #but
af <- factor(ac)
identical(af,af[1:2])
[1] FALSE

Any opinions?
Did a cross-check with Splus and it doesn't do that , so I think it
qualifies as a bug. Shouldn't be too hard to fix (might lose a little
efficiencty though).

--
   O__  ---- Peter Dalgaard             Blegdamsvej 3
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
I wondered that, but I think we need to hear from the author of 
identical().

It is neater to have attributes printed in a consistent order, though.

What about changing identical() to ignore the order of attributes?  Is 
there any code anywhere that depends on the order of attributes, other than 
identical()?  I've only seen attributes treated as an unordered set, and 
never as an ordered list.  There are some functions in S-plus that change 
the order of attributes, and the only thing this affects is 
identical().  (Which in S-plus also pays attention to the order of attributes.)

-- Tony Plate

At Tuesday 05:42 AM 4/20/2004, p.dalgaard@biostat.ku.dk wrote:
"Swinton, Jonathan" <Jonathan.Swinton@astrazeneca.com> writes:

 # works as expected
ac <- c('A','B');
identical(ac,ac[1:2])
[1] TRUE

 #but
af <- factor(ac)
identical(af,af[1:2])
[1] FALSE

Any opinions?
Did a cross-check with Splus and it doesn't do that , so I think it
qualifies as a bug. Shouldn't be too hard to fix (might lose a little
efficiencty though).

--
   O__  ---- Peter Dalgaard             Blegdamsvej 3
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
AFAIK identical() first introduced by Chambers in "Programming with 
data"?  On p262 he writes:

identical: The two objects must be exactly equal in all respects; if not 
identical returns FALSE
all.equal: The two objects are expected to be identical up to small 
differences that might be considered irrelevant...

Taken literally, this would seem to argue against identical() treating 
attributes as a set (unless one were to tighten up the definition of 
attributes in Section 2.2 of the R Language Definition to explicitly state 
that attributes are to be treated as an unordered set).

However, given the primary use of identical() on complex objects is in 
software testing, and AFAIK no software depends on the order of attributes, 
I still think it would be reasonable for attributes to be treated as a set 
by identical().  (Unless anyone can show that it's important to recognize 
order of attributes in some code.)

I'm proposing a more general fix for this problem because I strongly 
suspect that factor subsetting is not the only thing that can change the 
order of attributes, and because I've wasted many hours tracking down 
problems that turned out to be caused by problems with data.dump() and 
identical() in S-plus.  Another possible fix might be for the attr() and 
attributes() replacement functions to store attributes as a sorted list.  I 
don't know if this would be easy or difficult to implement, or what 
consequences it might have in terms of existing tests that involve printed 
output of attributes.

-- Tony Plate
I wondered that, but I think we need to hear from the author of
identical().

It is neater to have attributes printed in a consistent order, though.

On Tue, 20 Apr 2004, Tony Plate wrote:

What about changing identical() to ignore the order of attributes?  Is
there any code anywhere that depends on the order of attributes, other 
than
identical()?  I've only seen attributes treated as an unordered set, and
never as an ordered list.  There are some functions in S-plus that change
the order of attributes, and the only thing this affects is
identical().  (Which in S-plus also pays attention to the order of 
attributes.)
-- Tony Plate

At Tuesday 05:42 AM 4/20/2004, p.dalgaard@biostat.ku.dk wrote:
"Swinton, Jonathan" <Jonathan.Swinton@astrazeneca.com> writes:

 # works as expected
ac <- c('A','B');
identical(ac,ac[1:2])
[1] TRUE

 #but
af <- factor(ac)
identical(af,af[1:2])
[1] FALSE

Any opinions?
Did a cross-check with Splus and it doesn't do that , so I think it
qualifies as a bug. Shouldn't be too hard to fix (might lose a little
efficiencty though).

--
   O__  ---- Peter Dalgaard             Blegdamsvej 3
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

--
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
AFAIK identical() first introduced by Chambers in "Programming with
data"?  On p262 he writes:

identical: The two objects must be exactly equal in all respects; if not
identical returns FALSE
all.equal: The two objects are expected to be identical up to small
differences that might be considered irrelevant...

Taken literally, this would seem to argue against identical() treating
attributes as a set (unless one were to tighten up the definition of
attributes in Section 2.2 of the R Language Definition to explicitly state
that attributes are to be treated as an unordered set).
We're certainly down in the fine points here, so arguments either way
aren't very strong, but on the whole it seems cleaner to keep identical
on the pedantic side, dealing with what's actually in the object, rather
than what was "meant".

Yes, for practical purposes attribute order better NOT matter, but we do
store the attributes in a way that creates an "order", i.e., as an
internal vector or list structure rather than, say, a hash table.
However, given the primary use of identical() on complex objects is in
software testing, and AFAIK no software depends on the order of attributes,
I still think it would be reasonable for attributes to be treated as a set
by identical().  (Unless anyone can show that it's important to recognize
order of attributes in some code.)
Treating attributes as a set would have some logical appeal, but it
seems likely the fix would have to be more widespread than just to
identical().  Otherwise, for example, you could find yourself in a
situation where:
  identical(x,y)
was TRUE but
  identical(attributes(x), attributes(y))
was FALSE, because attributes() just reported out the attributes in
their (irrelevant) stored order.
I'm proposing a more general fix for this problem because I strongly
suspect that factor subsetting is not the only thing that can change the
order of attributes, and because I've wasted many hours tracking down
problems that turned out to be caused by problems with data.dump() and
identical() in S-plus.  Another possible fix might be for the attr() and
attributes() replacement functions to store attributes as a sorted list.  I
don't know if this would be easy or difficult to implement, or what
consequences it might have in terms of existing tests that involve printed
output of attributes.
Yes, as above, it does seem that a satisfactory solution would require
treating attributes() as something other than a vector, returned in
internal order.

Once started down this path, there are a number of other cases where a
vector has been used, for convenience, when an unordered set was the
more likely model.  I think there have been debates over whether the
order of the levels of an unordered factor should be considered
relevant.

It would increase consistency to replace vectors in these examples with
an efficient structure that only depended on the set of values
(presumably a suitable hashing mechanism would do).  But it's not too
likely to get to the head of the priority queue, I'd guess.

It's not out of the question, as an alternative that doesn't require
deep changes to the system, to write methods for identical() for some
classes of objects.
-- Tony Plate

At Tuesday 09:13 AM 4/20/2004, Prof Brian Ripley wrote:
I wondered that, but I think we need to hear from the author of
identical().

It is neater to have attributes printed in a consistent order, though.

On Tue, 20 Apr 2004, Tony Plate wrote:

What about changing identical() to ignore the order of attributes?  Is
there any code anywhere that depends on the order of attributes, other
than
identical()?  I've only seen attributes treated as an unordered set, and
never as an ordered list.  There are some functions in S-plus that change
the order of attributes, and the only thing this affects is
identical().  (Which in S-plus also pays attention to the order of
attributes.)
-- Tony Plate

At Tuesday 05:42 AM 4/20/2004, p.dalgaard@biostat.ku.dk wrote:
"Swinton, Jonathan" <Jonathan.Swinton@astrazeneca.com> writes:

 # works as expected
ac <- c('A','B');
identical(ac,ac[1:2])
[1] TRUE

 #but
af <- factor(ac)
identical(af,af[1:2])
[1] FALSE

Any opinions?
Did a cross-check with Splus and it doesn't do that , so I think it
qualifies as a bug. Shouldn't be too hard to fix (might lose a little
efficiencty though).

--
   O__  ---- Peter Dalgaard             Blegdamsvej 3
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

--
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

John M. Chambers                  jmc@bell-labs.com
Bell Labs, Lucent Technologies    office: (908)582-2681
700 Mountain Avenue, Room 2C-282  fax:    (908)582-3340
Murray Hill, NJ  07974            web: http://www.cs.bell-labs.com/~jmc