How x[, 'colname1'] is implemented? - R-devel

Peng Yu

Thu, Dec 31, 2009 3:27 PM #

I don't see where describes the implementation of '[]'.

For example, if x is a matrix or a data.frame, how the lookup of
'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the
a hash of the colnames? Is the reference O(1) or O(n), where n is the
second dim of x?

Barry Rowlingson

Fri, Jan 1, 2010 4:52 AM #

On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu <pengyu.ut at gmail.com> wrote:

Where have you looked? I doubt this kind of implementation detail is
in the .Rd documentation since a regular user doesn't care for it.

 As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!":

 Line 450 of subscript.c of the source code of R 2.10 is the
stringSubscript function. It has this comment:

/* The original code (pre 2.0.0) used a ns x nx loop that was too
 * slow.  So now we hash.  Hashing is expensive on memory (up to 32nx
 * bytes) so it is only worth doing if ns * nx is large.  If nx is
 * large, then it will be too slow unless ns is very small.
 */

The definition of "large" and "small" here appears to be such that:

457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 &&
ns)) || (ns * nx > 15*nx + ns) );

  Barry

Peng Yu

Fri, Jan 1, 2010 1:40 PM #

On Fri, Jan 1, 2010 at 6:52 AM, Barry Rowlingson

<b.rowlingson at lancaster.ac.uk> wrote:

I'm not complaining that it is not documented.

Could you explain what ns and nx represent?

Seth Falcon

Fri, Jan 1, 2010 4:56 PM #

On 1/1/10 1:40 PM, Peng Yu wrote:

integers :-)

Consider a 5x5 matrix m and a call like m[ , c("C", "D")], then
in the call to stringSubscript:

  s - The character vector of subscripts, here c("C", "D")

  ns - length of s, here 2

  nx - length of the dimension being subscripted, here 5

  names - the dimnames being subscripted.  Here, perhaps
          c("A", "B", "C", "D", "E")

The 'in' argument is always TRUE AFAICS so this boils down to:

Use hashing for x[i] if either length(x) > 1000 or length(i) > 1000 (and
we aren't in the trivial case where either length(x) == 0 or length(i) == 0)

OR use hashing if (ns * nx > 15*nx + ns)


+ seth

Barry Rowlingson

Sat, Jan 2, 2010 2:41 AM #

On Fri, Jan 1, 2010 at 9:40 PM, Peng Yu <pengyu.ut at gmail.com> wrote:

Yes but you didn't answer my question. When you ask a question on
these (or any mailing lists) you should always say what efforts you've
made to answer the question. I first looked in the help for '[' but
didn't find anything there. If you had already done that and told us
then I wouldn't have wasted my time. If you'd gone on to say you'd
also looked in the source code, and in which files, then I wouldn't
have wasted my time looking in those files. Note that this is all for
your benefit as well because we'll be able to help you quicker!

No, you need to take Obi-wan's advice (or Seth Falcon's, who got to
this before me) and 'Use the source'. I would hope that since you have
an understanding of what a hashing algorithm is that you should also
have some understanding of programming, and even if not in C it's not
too hard to figure out.

 You could make the effort to look in subscript.c and work it out for yourself.

 If you are very interested in the low-level working of R then you
should compile R with debugging turned on, then run R in a debugger
and set a breakpoint in the array subscript function. Then you can
inspect the C variables when you run it.

Barry