Skip to content

data frame subscription operator

5 messages · Vladimir Dergachev, Brian Ripley, Gabor Grothendieck

#
Hi all, 

   I was looking at the data frame subscription operator (attached in the end 
of this e-mail) and got puzzled by the following line:

    class(x) <- attr(x, "row.names") <- NULL

This appears to set the class and row.names attributes of the incoming data 
frame to NULL. So far I was not able to figure out why this is necessary - 
could anyone help ?

The reason I am looking at it is that changing attributes forces duplication 
of the data frame and this is the largest cause of slowness of data.frames in 
general.

                           thank you very much !

                                            Vladimir Dergachev
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
    1)
{
    mdrop <- missing(drop)
    Narg <- nargs() - (!mdrop)
    if (Narg < 3) {
        if (!mdrop)
            warning("drop argument will be ignored")
        if (missing(i))
            return(x)
        if (is.matrix(i))
            return(as.matrix(x)[i])
        y <- NextMethod("[")
        nm <- names(y)
        if (!is.null(nm) && any(is.na(nm)))
            stop("undefined columns selected")
        if (any(duplicated(nm)))
            names(y) <- make.unique(nm)
        return(structure(y, class = oldClass(x), row.names = attr(x,
            "row.names")))
    }
    rows <- attr(x, "row.names")
    cols <- names(x)
    cl <- oldClass(x)
    class(x) <- attr(x, "row.names") <- NULL
    if (missing(i)) {
        if (!missing(j))
            x <- x[j]
        cols <- names(x)
        if (any(is.na(cols)))
            stop("undefined columns selected")
    }
    else {
        if (is.character(i))
            i <- pmatch(i, as.character(rows), duplicates.ok = TRUE)
        rows <- rows[i]
        if (!missing(j)) {
            x <- x[j]
            cols <- names(x)
            if (any(is.na(cols)))
                stop("undefined columns selected")
        }
        for (j in seq_along(x)) {
            xj <- x[[j]]
            x[[j]] <- if (length(dim(xj)) != 2)
                xj[i]
            else xj[i, , drop = FALSE]
        }
    }
    if (drop) {
        drop <- FALSE
        n <- length(x)
        if (n == 1) {
            x <- x[[1]]
            drop <- TRUE
        }
        else if (n > 1) {
            xj <- x[[1]]
            nrow <- if (length(dim(xj)) == 2)
                dim(xj)[1]
            else length(xj)
            if (!mdrop && nrow == 1) {
                drop <- TRUE
                names(x) <- cols
                attr(x, "row.names") <- NULL
            }
        }
    }
    if (!drop) {
        names(x) <- cols
        if (any(is.na(rows) | duplicated(rows))) {
            rows[is.na(rows)] <- "NA"
            rows <- make.unique(rows)
        }
        if (any(duplicated(nm <- names(x))))
            names(x) <- make.unique(nm)
        attr(x, "row.names") <- rows
        class(x) <- cl
    }
    x
}
<environment: namespace:base>
1 day later
#
'[' is the 'subscript' or 'extraction', not 'subscription' operator: this 
is also called 'indexing', as in 'An Introduction to R'.
On Mon, 6 Nov 2006, Vladimir Dergachev wrote:

            
Actually no, it removes them: see ?attr and ?class.
You need to remove the class to avoid recursion: a few lines later x[i]
needs to be a call to the primitive and not the data frame method.
Do you have evidence of that?  R has facilities to profile its code, and I 
have never seen  [.data.frame taking a significant proportion of the total 
time.  If it does for your application, consider if a data frame is an 
appropriate way to store your data.  I am not sure we would accept that
data frames do have 'slowness in general', but their generality does make 
them slower than alternatives where the generality is not needed.

[...]
#
On Wednesday 08 November 2006 3:21 am, Prof Brian Ripley wrote:
I see. Is there a way to get at the primitive directly, i.e. something like
`[.list`(x, i) ?
Evidence:

	# this can be copy'n'pasted directly into an R session
	# small N - both system calls return small, but comparable running times
	N<-100000
	A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
	system.time(B<-A[,1])
	system.time(B<-A[1,1])


	#larger N - both times are larger and still comparable
	N<-1000000
	A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
	system.time(B<-A[,1])
	system.time(B<-A[1,1])
        
The running times would also grow with the number of columns. Also I have 
modified 2.4.0 version of R to print out large allocations and I get the 
impression that the data frame is being duplicated. Same happens for 
`[<-.data.frame` - but this function has much more complex code, I have not 
looked through it yet.

Of course, getting a small portion (i.e. A[1:5,]) also takes a lot of time - 
but the examples showed above should be O(1).

My data is a result of data base query - it has naturally columns of different 
types and the columns are named (no row.names though) - which is why I used 
data.frames. What would you suggest ?

                    thank you very much !

                             Vladimir Dergachev
#
.subset and .subset2 are equivalent to [ and [[ except that
dispatch does not take place.  See ?.subset
On 11/8/06, Vladimir Dergachev <vdergachev at rcgardis.com> wrote:
#
On Wednesday 08 November 2006 11:41 am, Gabor Grothendieck wrote:
Thank you Gabor !

I made an experiment and got rid of 

 class(x) <- attr(x, "row.names") <- NULL

 while replacing all occurrences of x[ and x[[ with .subset and .subset2 . 

 Results:

    X<-A[,1]  is now instantaneous, as it should be.

    X<-A[1,1] is faster for data frames with many columns, but still appears 
to make a copy of A[,1] before indexing. Not sure why..

                 thank you

                    Vladimir Dergachev