Prev 40691 / 63424 Next

[datatable-help] speeding up perception

Matt Dowle

Tue, Jul 5, 2011 11:08 AM

Simon (and all),

I've tried to make assignment as fast as calling `[<-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[<-' from
copying x?  Small reproducible example in vanilla R 2.13.0 :

[1] "<0xa1ec758>"

tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?

I've tried returning NULL from [<-.newclass but then x gets assigned
NULL :

tracemem[0xa1ec558 -> 0x9c5f318]:

NULL

Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[<-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew

On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:

Simon,

Thanks for the great suggestion. I've written a skeleton assignment
function for data.table which incurs no copies, which works for this
case. For completeness, if I understand correctly, this is for : 
  i) convenience of new users who don't know how to vectorize yet
  ii) more complex examples which can't be vectorized.

Before:

system.time(for (r in 1:R) DT[r,20] <- 1.0)

   user  system elapsed 
 12.792   0.488  13.340 

After :

system.time(for (r in 1:R) DT[r,20] <- 1.0)

   user  system elapsed 
  2.908   0.020   2.935

Where this can be reduced further as follows :

system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))

   user  system elapsed 
  0.132   0.000   0.131

Still working on it. When it doesn't break other data.table tests, I'll
commit to R-Forge ...

Matthew


On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:

Timoth?e,

On Jul 4, 2011, at 2:47 AM, Timoth?e Carayol wrote:

Hi --

It's my first post on this list; as a relatively new user with little
knowledge of R internals, I am a bit intimidated by the depth of some
of the discussions here, so please spare me if I say something
incredibly silly.

I feel that someone at this point should mention Matthew Dowle's
excellent data.table package
(http://cran.r-project.org/web/packages/data.table/index.html) which
seems to me to address many of the inefficiencies of data.frame.
data.tables have no row names; and operations that only need data from
one or two columns are (I believe) just as quick whether the total
number of columns is 5 or 1000. This results in very quick operations
(and, often, elegant code as well).

I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.

Cheers,
Simon

On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.welch at gmail.com> wrote:

thank you, simon.  this was very interesting indeed.  I also now
understand how far out of my depth I am here.

fortunately, as an end user, obviously, *I* now know how to avoid the
problem.  I particularly like the as.list() transformation and back to
as.data.frame() to speed things up without loss of (much)
functionality.


more broadly, I view the avoidance of individual access through the
use of apply and vector operations as a mixed "IQ test" and "knowledge
test" (which I often fail).  However, even for the most clever, there
are also situations where the KISS programming principle makes
explicit loops still preferable.  Personally, I would have preferred
it if R had, in its standard "statistical data set" data structure,
foregone the row names feature in exchange for retaining fast direct
access.  R could have reserved its current implementation "with row
names but slow access" for a less common (possibly pseudo-inheriting)
data structure.


If end users commonly do iterations over a data frame, which I would
guess to be the case, then the impression of R by (novice) end users
could be greatly enhanced if the extreme penalties could be eliminated
or at least flagged.  For example, I wonder if modest special internal
code could store data frames internally and transparently as lists of
vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
but specific warning message could be issued with a suggestion if
there is an individual read/write into a data frame ("Warning: data
frames are much slower than lists of vectors for individual element
access").


I would also suggest changing the "Introduction to R" 6.3  from "A
data frame may for many purposes be regarded as a matrix with columns
possibly of differing modes and attributes. It may be displayed in
matrix form, and its rows and columns extracted using matrix indexing
conventions." to "A data frame may for many purposes be regarded as a
matrix with columns possibly of differing modes and attributes. It may
be displayed in matrix form, and its rows and columns extracted using
matrix indexing conventions.  However, data frames can be much slower
than matrices or even lists of vectors (which, like data frames, can
contain different types of columns) when individual elements need to
be accessed."  Reading about it immediately upon introduction could
flag the problem in a more visible manner.


regards,

/iaw

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Thread (23 messages)

ivo welch speeding up perception Jul 2 Simon Urbanek speeding up perception Jul 2 Robert Stojnic speeding up perception Jul 3 Simon Urbanek speeding up perception Jul 3 ivo welch speeding up perception Jul 3 Timothée Carayol speeding up perception Jul 3 Simon Urbanek speeding up perception Jul 4 Tim Hesterberg speeding up perception Jul 4 Matt Dowle speeding up perception Jul 5 Matt Dowle speeding up perception Jul 5 Simon Urbanek speeding up perception Jul 5 Luke Tierney speeding up perception Jul 5 Luke Tierney speeding up perception Jul 5 David Winsemius speeding up perception Jul 5 Simon Urbanek speeding up perception Jul 5 Matt Dowle speeding up perception Jul 6 Simon Urbanek speeding up perception Jul 6 Luke Tierney speeding up perception Jul 6 Matt Dowle speeding up perception Jul 11 Simon Urbanek speeding up perception Jul 11 Matt Dowle speeding up perception Jul 12 Matt Dowle speeding up perception Jul 12 Simon Urbanek speeding up perception Jul 12