Skip to content
Back to formatted view

Raw Message

Message-ID: <971536df0611080841h3350d9at3674bdc6139de4b6@mail.gmail.com>
Date: 2006-11-08T16:41:12Z
From: Gabor Grothendieck
Subject: data frame subscription operator
In-Reply-To: <200611081113.54152.vdergachev@rcgardis.com>

.subset and .subset2 are equivalent to [ and [[ except that
dispatch does not take place.  See ?.subset


On 11/8/06, Vladimir Dergachev <vdergachev at rcgardis.com> wrote:
> On Wednesday 08 November 2006 3:21 am, Prof Brian Ripley wrote:
> >
> > > So far I was not able to figure out why this is necessary -
> > > could anyone help ?
> >
> > You need to remove the class to avoid recursion: a few lines later x[i]
> > needs to be a call to the primitive and not the data frame method.
>
> I see. Is there a way to get at the primitive directly, i.e. something like
> `[.list`(x, i) ?
>
> >
> > > The reason I am looking at it is that changing attributes forces
> > > duplication of the data frame and this is the largest cause of slowness
> > > of data.frames in general.
> >
> > Do you have evidence of that?  R has facilities to profile its code, and I
> > have never seen  [.data.frame taking a significant proportion of the total
> > time.  If it does for your application, consider if a data frame is an
> > appropriate way to store your data.  I am not sure we would accept that
> > data frames do have 'slowness in general', but their generality does make
> > them slower than alternatives where the generality is not needed.
>
> Evidence:
>
>        # this can be copy'n'pasted directly into an R session
>        # small N - both system calls return small, but comparable running times
>        N<-100000
>        A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
>        system.time(B<-A[,1])
>        system.time(B<-A[1,1])
>
>
>        #larger N - both times are larger and still comparable
>        N<-1000000
>        A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
>        system.time(B<-A[,1])
>        system.time(B<-A[1,1])
>
> The running times would also grow with the number of columns. Also I have
> modified 2.4.0 version of R to print out large allocations and I get the
> impression that the data frame is being duplicated. Same happens for
> `[<-.data.frame` - but this function has much more complex code, I have not
> looked through it yet.
>
> Of course, getting a small portion (i.e. A[1:5,]) also takes a lot of time -
> but the examples showed above should be O(1).
>
> My data is a result of data base query - it has naturally columns of different
> types and the columns are named (no row.names though) - which is why I used
> data.frames. What would you suggest ?
>
>                    thank you very much !
>
>                             Vladimir Dergachev
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>