"[.data.frame" and lapply

Wacek Kusnierczyk · 2009-03-28T18:47:20Z

Romain Francois wrote: > Wacek Kusnierczyk wrote: >> redirected to r-devel, because there are implementational details of >> [.data.frame discussed here. spoiler: at the bottom there is a fairly >> interesting performance result. >> >> Romain Francois wrote: >> >>> Hi, >>> >>> This is a bug I think. [.data.frame treats its arguments differently >>> depending on the number of arguments. >>> >> >> you might want to hesitate a bit before you say that something in r is a >> bug, if only beca

Wacek Kusnierczyk

Sat, Mar 28, 2009 11:47 AM

Romain Francois wrote:

obviously.  it seems that there is a bug here, and that it results from
the lack of clear design specification.

yes;  i didn't take it into consideration, but (still without detailed
analysis) i guess it should not be difficult to extend the code to
handle this.

this should be easy to handle by checking if i is a matrix and then
indexing by its first column as i and the second as j.

yes, here's how it's done (original comment):

    if(is.matrix(i))
        return(as.matrix(x)[i])  # desperate measures

and i can easily add this to my code, at virtually no additional expense.

it's probably not a good idea to convert x to a matrix, x would often be
much more data than the index matrix m, so it's presumably much more
efficient, on average, to fiddle with i instead.

there are some potentially confusing issues here:

    m = cbind(8:10, 1:3)
   
    d[m]
    # 3-element vector, as you could expect

    d[t(m)]
    # 6-element vector

t(m) has dimensionality inappropriate for matrix indexing (it has 3
columns), so it gets flattened into a vector;  however, it does not work
like in the case of a single vector index where columns would be selected:

    d[as.vector(t(m))]
    # error: undefined columns selected

i think it would be more appropriate to raise an error in a case like
d[t(m)].

furthermore, if a matrix is used in a two-index form, the matrix is
flattened again and is used to select rows (not elements, as in
d[t(m)]).  note also that the help page says that "for extraction, 'x'
is first coerced to a matrix".  it fails to explain that if *two*
indices are used of which at least one is a matrix, no coercion is
done.  that is, the matrix is again flattened into a vector, but here
[.data.frame forgets that it was a matrix (unlike in d[t(m)]):

    is(d[m])
    # a character vector, matrix indexing

    is(d[t(m)])
    # a character vector, vector indexing of elements, not columns

    is(d[m,])
    # a data frame, row indexing
   
and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix
before the indexing means that the types of values in a some columns in
d may get coerced to another type:

    d[,2] = as.character(d[,2])
    is(d[,1])
    # integer vector
    is(d[,2])
    # character vector

    is(d[1:2, 1])
    # integer vector
    is(d[cbind(1:2, 1)])
    # character vector


for all it's worth, i think matrix indexing of data frames should be
dropped:

    d[m]
    # error: ...

 and if one needs it, it's as simple as

    as.matrix(d)[m]

where the conversion of d to a matrix is explicit.

on the side, [.data.frame is able to index matrices:

    '[.data.frame'(as.matrix(d), m)
    # same as as.matrix(d)[m]

which is, so to speak, nonsense, since '[.data.frame' is designed
specifically to handle data frames;  i'd expect an error to be raised
here (or a warning, at the very least).

to summarize, the fact that subdf does not handle matrix indices is not
an issue.  anyway, thanks for the comment!

best,
vQ

"[.data.frame" and lapply

Thread (3 messages)