"[.data.frame" and lapply
Romain Francois wrote:
Wacek Kusnierczyk wrote:
redirected to r-devel, because there are implementational details of [.data.frame discussed here. spoiler: at the bottom there is a fairly interesting performance result. Romain Francois wrote:
Hi,
This is a bug I think. [.data.frame treats its arguments differently
depending on the number of arguments.
you might want to hesitate a bit before you say that something in r is a bug, if only because it drives certain people mad. r is a carefully tested software, and [.data.frame is such a basic function that if what you talk about were a bug, it wouldn't have persisted until now.
I did hesitate, and would be prepared to look the other way of someone shows me proper evidence that this makes sense.
d <- data.frame( x = 1:10, y = 1:10, z = 1:10 ) d[ j=1 ]
x y z 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 "If a single index is supplied, it is interpreted as indexing the list of columns". Clearly this does not happen here, and this is because NextMethod gets confused.
obviously. it seems that there is a bug here, and that it results from the lack of clear design specification.
I have not looked your implementation in details, but it misses array indexing, as in:
yes; i didn't take it into consideration, but (still without detailed analysis) i guess it should not be difficult to extend the code to handle this.
d <- data.frame( x = 1:10, y = 1:10, z = 1:10 ) m <- cbind( 5:7, 1:3 ) m
[,1] [,2] [1,] 5 1 [2,] 6 2 [3,] 7 3
d[m]
[1] 5 6 7
subdf( d, m )
Error in subdf(d, m) : undefined columns selected
this should be easy to handle by checking if i is a matrix and then indexing by its first column as i and the second as j.
"Matrix indexing using '[' is not recommended, and barely
supported. For extraction, 'x' is first coerced to a matrix. For
replacement a logical matrix (only) can be used to select the
elements to be replaced in the same way as for a matrix."
yes, here's how it's done (original comment):
if(is.matrix(i))
return(as.matrix(x)[i]) # desperate measures
and i can easily add this to my code, at virtually no additional expense.
it's probably not a good idea to convert x to a matrix, x would often be
much more data than the index matrix m, so it's presumably much more
efficient, on average, to fiddle with i instead.
there are some potentially confusing issues here:
m = cbind(8:10, 1:3)
d[m]
# 3-element vector, as you could expect
d[t(m)]
# 6-element vector
t(m) has dimensionality inappropriate for matrix indexing (it has 3
columns), so it gets flattened into a vector; however, it does not work
like in the case of a single vector index where columns would be selected:
d[as.vector(t(m))]
# error: undefined columns selected
i think it would be more appropriate to raise an error in a case like
d[t(m)].
furthermore, if a matrix is used in a two-index form, the matrix is
flattened again and is used to select rows (not elements, as in
d[t(m)]). note also that the help page says that "for extraction, 'x'
is first coerced to a matrix". it fails to explain that if *two*
indices are used of which at least one is a matrix, no coercion is
done. that is, the matrix is again flattened into a vector, but here
[.data.frame forgets that it was a matrix (unlike in d[t(m)]):
is(d[m])
# a character vector, matrix indexing
is(d[t(m)])
# a character vector, vector indexing of elements, not columns
is(d[m,])
# a data frame, row indexing
and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix
before the indexing means that the types of values in a some columns in
d may get coerced to another type:
d[,2] = as.character(d[,2])
is(d[,1])
# integer vector
is(d[,2])
# character vector
is(d[1:2, 1])
# integer vector
is(d[cbind(1:2, 1)])
# character vector
for all it's worth, i think matrix indexing of data frames should be
dropped:
d[m]
# error: ...
and if one needs it, it's as simple as
as.matrix(d)[m]
where the conversion of d to a matrix is explicit.
on the side, [.data.frame is able to index matrices:
'[.data.frame'(as.matrix(d), m)
# same as as.matrix(d)[m]
which is, so to speak, nonsense, since '[.data.frame' is designed
specifically to handle data frames; i'd expect an error to be raised
here (or a warning, at the very least).
to summarize, the fact that subdf does not handle matrix indices is not
an issue. anyway, thanks for the comment!
best,
vQ