I don't see where describes the implementation of '[]'. For example, if x is a matrix or a data.frame, how the lookup of 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the a hash of the colnames? Is the reference O(1) or O(n), where n is the second dim of x?
How x[, 'colname1'] is implemented?
5 messages · Barry Rowlingson, Peng Yu, Seth Falcon
On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
I don't see where describes the implementation of '[]'. For example, if x is a matrix or a data.frame, how the lookup of 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the a hash of the colnames? Is the reference O(1) or O(n), where n is the second dim of x?
Where have you looked? I doubt this kind of implementation detail is in the .Rd documentation since a regular user doesn't care for it. As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!": Line 450 of subscript.c of the source code of R 2.10 is the stringSubscript function. It has this comment: /* The original code (pre 2.0.0) used a ns x nx loop that was too * slow. So now we hash. Hashing is expensive on memory (up to 32nx * bytes) so it is only worth doing if ns * nx is large. If nx is * large, then it will be too slow unless ns is very small. */ The definition of "large" and "small" here appears to be such that: 457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 && ns)) || (ns * nx > 15*nx + ns) ); Barry
On Fri, Jan 1, 2010 at 6:52 AM, Barry Rowlingson
<b.rowlingson at lancaster.ac.uk> wrote:
On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
I don't see where describes the implementation of '[]'. For example, if x is a matrix or a data.frame, how the lookup of 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the a hash of the colnames? Is the reference O(1) or O(n), where n is the second dim of x?
?Where have you looked? I doubt this kind of implementation detail is in the .Rd documentation since a regular user doesn't care for it.
I'm not complaining that it is not documented.
?As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!": ?Line 450 of subscript.c of the source code of R 2.10 is the stringSubscript function. It has this comment: /* The original code (pre 2.0.0) used a ns x nx loop that was too ?* slow. ?So now we hash. ?Hashing is expensive on memory (up to 32nx ?* bytes) so it is only worth doing if ns * nx is large. ?If nx is ?* large, then it will be too slow unless ns is very small. ?*/
Could you explain what ns and nx represent?
The definition of "large" and "small" here appears to be such that: 457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 && ns)) || (ns * nx > 15*nx + ns) );
On 1/1/10 1:40 PM, Peng Yu wrote:
On Fri, Jan 1, 2010 at 6:52 AM, Barry Rowlingson <b.rowlingson at lancaster.ac.uk> wrote:
On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
I don't see where describes the implementation of '[]'. For example, if x is a matrix or a data.frame, how the lookup of 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the a hash of the colnames? Is the reference O(1) or O(n), where n is the second dim of x?
Where have you looked? I doubt this kind of implementation detail is in the .Rd documentation since a regular user doesn't care for it.
I'm not complaining that it is not documented.
As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!": Line 450 of subscript.c of the source code of R 2.10 is the stringSubscript function. It has this comment: /* The original code (pre 2.0.0) used a ns x nx loop that was too * slow. So now we hash. Hashing is expensive on memory (up to 32nx * bytes) so it is only worth doing if ns * nx is large. If nx is * large, then it will be too slow unless ns is very small. */
Could you explain what ns and nx represent?
integers :-)
Consider a 5x5 matrix m and a call like m[ , c("C", "D")], then
in the call to stringSubscript:
s - The character vector of subscripts, here c("C", "D")
ns - length of s, here 2
nx - length of the dimension being subscripted, here 5
names - the dimnames being subscripted. Here, perhaps
c("A", "B", "C", "D", "E")
The definition of "large" and "small" here appears to be such that: 457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 && ns)) || (ns * nx > 15*nx + ns) );
The 'in' argument is always TRUE AFAICS so this boils down to: Use hashing for x[i] if either length(x) > 1000 or length(i) > 1000 (and we aren't in the trivial case where either length(x) == 0 or length(i) == 0) OR use hashing if (ns * nx > 15*nx + ns) + seth
On Fri, Jan 1, 2010 at 9:40 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
I'm not complaining that it is not documented.
Yes but you didn't answer my question. When you ask a question on these (or any mailing lists) you should always say what efforts you've made to answer the question. I first looked in the help for '[' but didn't find anything there. If you had already done that and told us then I wouldn't have wasted my time. If you'd gone on to say you'd also looked in the source code, and in which files, then I wouldn't have wasted my time looking in those files. Note that this is all for your benefit as well because we'll be able to help you quicker!
?As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!":
Could you explain what ns and nx represent?
No, you need to take Obi-wan's advice (or Seth Falcon's, who got to this before me) and 'Use the source'. I would hope that since you have an understanding of what a hashing algorithm is that you should also have some understanding of programming, and even if not in C it's not too hard to figure out. You could make the effort to look in subscript.c and work it out for yourself. If you are very interested in the low-level working of R then you should compile R with debugging turned on, then run R in a debugger and set a breakpoint in the array subscript function. Then you can inspect the C variables when you run it. Barry