(diverted to r-devel, a source code patch attached)
Wacek Kusnierczyk wrote:
Allan Engelhardt wrote:
Immaterial, yes, but it is always good to test :) and your solution
*is* faster and it is even faster if you can assume byte strings:
:)
indeed; though if the speed is immaterial (and in this case it
supposedly was), it's probably not worth risking fixed=TRUE removing
'.tif' from the middle of the name, however unlikely this might be (cf
murphy's laws).
but if you can assume that each string ends with a '.tif' (or any other
\..{3} substring), then substr is marginally faster than sub, even as a
three-pass approach, while avoiding the risk of removing '.tif' from the
middle:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr={basenames=basename(strings); substr(basenames, 1,
nchar(basenames)-4)},
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1 substr 3.176
# 2 sub 3.296
btw., i wonder why negative indices default to 1 in substr:
substr('foobar', -5, 5)
# "fooba"
# substr('foobar', 1, 5)
substr('foobar', 2, -2)
# ""
# substr('foobar', 2, 1)
this does not seem to be documented in ?substr. there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):
# hypothetical
substr('foobar', -5, 5)
# "ooba"
# substr('foobar', 6-5+1, 5)
substr('foobar', 2, -2)
# "ooba"
# substr('foobar', 2, 6-2+1)
there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch. the patch has been
created and tested as follows:
svn co https://svn.r-project.org/R/trunk r-devel
cd r-devel
# modifications made to src/main/character.c
svn diff > character.c.diff
svn revert -R .
patch -p0 < character.c.diff
./configure
make
make check-all
# no problems reported
with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr=substr(basename(strings), 1, -5),
'substr-nchar'={
basenames=basename(strings)
substr(basenames, 1, nchar(basenames)-4) },
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1 substr 2.981
# 2 substr-nchar 3.206
# 3 sub 3.273
if this sounds interesting, i can update the docs accordingly.
vQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: character.c.diff
Type: text/x-diff
Size: 597 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090528/d1381eb7/attachment.bin>