setdiff bizarre (was: odd behavior out of setdiff)

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090602/9a8d85ef/attachment.pl>
     '1:3' %in% data.frame(a=2:4,b=1:3)  # TRUE

utterly weird.  so what would x have to be so that

    x %in% data.frame('a')
    # TRUE

hint: 

    '1' %in% data.frame(1)
    # TRUE

vQ
%in% is a thin wrapper on a call to match().  match() is
not a generic function (and is not documented to be one),
so it treats data.frames as lists, as their underlying
representation is a list of columns.  match is documented
to convert lists to character and to then run the character
version of match on that character data.  match does not
bail out if the types of the x and table arguments don't match
(that would be undesirable in the integer/numeric mismatch case).
Hence
   '1' %in% data.frame(1) # -> TRUE
is acting consistently with
   match(as.character(pi), c(1, pi, exp(1))) # -> 2
and
   1L %in% c(1.0, 2.0, 3.0) # -> TRUE

The related functions, duplicated() and unique(), do have
row-wise data.frame methods.  E.g.,
   > duplicated(data.frame(x=c(1,2,2,3,3),y=letters[c(1,1,2,2,2)]))
   [1] FALSE FALSE FALSE FALSE  TRUE
Perhaps match() ought to have one also.  S+'s match is generic
and has a data.frame method (which is row-oriented) so there we get:
   >  match(data.frame(x=c(1,3,5), y=letters[c(1,3,5)]),
data.frame(x=1:10,y=letters[1:10]))
   [1] 1 3 5
   > is.element(data.frame(x=1:10,y=letters[1:10]),
data.frame(x=c(1,3,5), y=letters[c(1,3,5)]))
    [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

I think that %in% and is.element() ought to remain calls to match()
and that if you want them to work row-wise on data.frames then
match should get a data.frame method.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: r-devel-bounces at r-project.org 
[mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk
Sent: Tuesday, June 02, 2009 9:11 AM
To: Stavros Macrakis
Cc: r-devel at r-project.org; dwinsemius at comcast.net
Subject: Re: [Rd] setdiff bizarre

Stavros Macrakis wrote:
     '1:3' %in% data.frame(a=2:4,b=1:3)  # TRUE

utterly weird.  so what would x have to be so that

    x %in% data.frame('a')
    # TRUE

hint: 

    '1' %in% data.frame(1)
    # TRUE

vQ

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

but simply treats the data frame as a *character* list:

? ? 1 %in% data.frame(a=2,b=1) ?# TRUE
? ? '1' %in% data.frame(a=2,b=1) ?# TRUE
? ? 1 %in% data.frame(a=2:3,b=1:2) # FALSE
? ? 1:3 %in% data.frame(a=2:4,b=1:3) ?# FALSE FALSE FALSE
? ? '1:3' %in% data.frame(a=2:4,b=1:3) ?# TRUE
It applies as.character to the dataframe:

 > z=data.frame(a=2:4,b=1:3)
 > as.character(z)
 [1] "2:4" "1:3"

  The as.character method for data frames seems to spot integer
sequences (but only for int types and not num types) and show the a:b
notation:

 > x=data.frame(z=as.integer(c(1,2,3,4,5)))
 > str(x)
 'data.frame':	5 obs. of  1 variable:
  $ z: int  1 2 3 4 5
 > as.character(x)
 [1] "1:5"

 Obviously it doesn't do this for vectors:

 > as.character(x$z)
 [1] "1" "2" "3" "4" "5"

 I suspect it's using 'deparse()' to get the character representation.
This function is mentioned in ?as.character, but as.character.default
disappears into the infernal .Internal and I don't have time to chase
source code - it's sunny outside!

Barry
%in% is a thin wrapper on a call to match().  match() is
not a generic function (and is not documented to be one),
so it treats data.frames as lists, as their underlying
representation is a list of columns.  match is documented
to convert lists to character and to then run the character
version of match on that character data.  match does not
bail out if the types of the x and table arguments don't match
(that would be undesirable in the integer/numeric mismatch case).

yes, i understand that this is documented behaviour, and that it's not a
bug.  nevertheless, the example is odd, and hints that there's a design
flaw.  i also do not understand why the following should be useful and
desirable:

    as.character(list('a'))
    # "a"

    as.character(data.frame('a'))
    # "1"

and hence

    'a' %in% list('a')
    # TRUE

while

    'a' %in% data.frame('a')
    # FALSE
    '1' %in% data.frame('a')
    # TRUE

there is a mechanistic explanation for how this works, but is there one
for why this works this way?
Hence
   '1' %in% data.frame(1) # -> TRUE
is acting consistently with
   match(as.character(pi), c(1, pi, exp(1))) # -> 2
and
   1L %in% c(1.0, 2.0, 3.0) # -> TRUE

The related functions, duplicated() and unique(), do have
row-wise data.frame methods.  E.g.,
   > duplicated(data.frame(x=c(1,2,2,3,3),y=letters[c(1,1,2,2,2)]))
   [1] FALSE FALSE FALSE FALSE  TRUE
Perhaps match() ought to have one also.  S+'s match is generic
and has a data.frame method (which is row-oriented) so there we get:
   >  match(data.frame(x=c(1,3,5), y=letters[c(1,3,5)]),
data.frame(x=1:10,y=letters[1:10]))
   [1] 1 3 5
   > is.element(data.frame(x=1:10,y=letters[1:10]),
data.frame(x=c(1,3,5), y=letters[c(1,3,5)]))
    [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

I think that %in% and is.element() ought to remain calls to match()
and that if you want them to work row-wise on data.frames then
match should get a data.frame method.

sounds good to me.  how is

    'a' %in% data.frame('a')

in S+?

thanks for the response.

regards,
vQ
...
The related functions, duplicated() and unique(), do have
row-wise data.frame methods.  E.g.,
   > duplicated(data.frame(x=c(1,2,2,3,3),y=letters[c(1,1,2,2,2)]))
   [1] FALSE FALSE FALSE FALSE  TRUE
Perhaps match() ought to have one also.  S+'s match is generic
and has a data.frame method (which is row-oriented) so there we get:
   >  match(data.frame(x=c(1,3,5), y=letters[c(1,3,5)]),
data.frame(x=1:10,y=letters[1:10]))
   [1] 1 3 5
   > is.element(data.frame(x=1:10,y=letters[1:10]),
data.frame(x=c(1,3,5), y=letters[c(1,3,5)]))
    [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

I think that %in% and is.element() ought to remain calls to match()
and that if you want them to work row-wise on data.frames then
match should get a data.frame method.

sounds good to me.  how is

    'a' %in% data.frame('a')

in S+?

thanks for the response.
S+ gives:
   >  'a' %in% data.frame(letters)
   [1] TRUE
   > 'a' %in% data.frame(letters[2:26])
   [1] FALSE
but that special case, x a scalar and table a data.frame with
one column, gets by more or less by accident.
   > 'a' %in% data.frame(letters, num=1:26)
   Problem in match.data.frame(x, table, nomatch, incom..: table must be
a list the same length as x
   > c('a', 'b') %in% data.frame(letters)
   Problem in match.data.frame(x, table, nomatch, incom..: table must be
a list the same length as x
The intent is that the x and table arguments to match be
compatible data.frames.

S+'s match works differently on lists than R's does.  It is set
up to work on data.frame-like things: x and table must be
lists of the the same length and within each list, each element
must have the same length.  It acts like
  match(do.call("paste",x), do.call("paste",table))
but doesn't actually do the conversion to character implied in
that (it hashes all the entries in each 'row' into one hash table
entry, using the usual type-specific hash number computation
on each entry and combining them to make the row hash number).
E.g.,
   > match(list(c(3,2), c(1,7), c(4,1)),
list(c(1,4,2,3),c(0,6,7,1),c(0,5,1,4)))
   [1] 4 3

(Its match.data.frame() doesn't actually call this, for
historical/inertial
reasons.  It goes the paste() route.)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
[...]
I suspect it's using 'deparse()' to get the character representation.
This function is mentioned in ?as.character, but as.character.default
disappears into the infernal .Internal and I don't have time to chase
source code - it's sunny outside!

on the side, as.character triggers do_ascharacter, which in turn calls
DispatchOrEval, a function with the following beautiful comment:

"To call this an ugly hack would be to insult all existing ugly hacks at
large in the world."

a fortune?

vQ
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090602/0bc11c84/attachment.pl>