Skip to content

Infelicity in print output with matrix indexing of `[.data.frame`

7 messages · David Winsemius, Bert Gunter, Jeff Newmiller +1 more

#
This puzzle started with an SO posting where the questioner showed output from a dataframe that had been indexed with a matrix. The output appeared to show that numeric values had been coerced to character. Once I got a reproducible example I discovered that the print output was the problem and that the actual values had not been coerced. I've created a much smaller test case and it appears from the testing below that a matrix indexed output from a dataframe with mixed numeric and character types will be printed as character even if none of the values indexed are character:
A B C  D
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
row col
[1,]   3   3
[2,]   2   4
[1] 20 20

That was as expected. Now repeat the process with a dataframe of mixed types.
A B  C  D
1 a 4  7 10
2 a 5  8 NA
3 a 6 NA 12
[1] "20" "20"

Quoted print output was not what I was expecting.
#
It's documented, David:
"Matrix indexing (x[i] with a logical or a 2-column integer matrix i)
using [ is not recommended. For extraction, x is first coerced to a
matrix..."

ergo characters for a mixed mode frame.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Dec 17, 2016 at 10:49 AM, David Winsemius
<dwinsemius at comcast.net> wrote:
#
Can we agree that it is most ironic that `[<-.data.frame` does not impose coercion on its `x` argument with a 2 column matrix as `i`, but that `[.data.frame` does? I had initially assumed that the coercion had occurred at the time of assignment which would have made more sense (to me, anyway).
#
No, cannot agree. The result of using an n by 2 matrix to index into a rectangular object is a vector. A vector can only have one storage mode for all elements. Some type coercion is necessary to accommodate this.
#
I have no argument with the premise that an atomic vector must be of a single mode.  But the exact same values were established with a numeric vector into those positions indexed by the 2-column matrix. Why does extraction need to coerce the entire dataframe to matrix when none of the extracted values are character? I suppose my request is that the very simple line in `[.data.frame`


    if (is.matrix(i)) 
            return(as.matrix(x)[i])

If it were replaced by code that would only extract from the values needed and then use a shifted version of the selection matrix, you could get values that were not coerced by being innocent bystanders of a dataframe colum that was not relevant.

as.matrix( x[ min( i[ , 1]):max( i[ , 1]), min( i[ ,2 ]):max(i[ , 2]) ])[
                   cbind( i[,1]-min( i[ , 1]) +1, i[,2]- min( i[ ,2 ]) +1) ]
#
Ah, "why"... perhaps because the speed reduction involved in successive indexing operations on data frames was considered unacceptable to the programmer? (Also the code would essentially have to check for type conversion of the result vector as every row of the index matrix was retrieved.) Perhaps for backward compatibility?

You could code your own version that behaved the way you like, but I think the usual expectation is that indexing should be faster than an R for loop, so hiding such behavior behind [.data.frame seems a bit deceptive to me. 

It seems much more straightforward to me to explicitly convert that portion of the data frame that you intend to do matrix indexing with into a matrix of known type for the purposes of this task, rather than expecting [.data.frame to figure out that you don't plan to retrieve values from the non-numeric columns of the data frame. (Sometimes the fact that things are hard is a hint that you should re-think your solution.)
#
More likely, to avoid having the type of the result depend on the value of the index. Also, sub-index consistency: does one really want D[M][1:2] to be of a different type than D[M[1:2,]].

-pd