Skip to content
Prev 45058 / 63424 Next

Regression stars

On Feb 12, 2013, at 11:05 AM, Brian Lee Yung Rowe wrote:

            
Not really:
user  system elapsed 
 13.780   0.444  14.229
used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  182113  9.8     407500   21.8    337655   18.1
Vcells 5789638 44.2  133982285 1022.3 163019778 1243.8
user  system elapsed 
 13.201   0.668  13.873 


But your test is bogus, because %in% uses match() which converts factors to character vectors anyway, so in your case you're just measuring noise in your system, character vectors are always faster in your example.

The reason is that in R strings are hashed so character vectors are technically very similar to factors just with faster access (because they don't need to go through the integer indirection). On 32-bit strings are in theory always faster than factors, on 64-bit they use double the size so they may or may not be faster depending on how you hit the cache etc. Anyway, in modern R versions you're much better off using character vectors than factors for any processing, so stringsAsFactors=FALSE is what I use exclusively.

Cheers,
Simon