Skip to content

split strings

4 messages · William Dunlap, Wacek Kusnierczyk

#
(diverted to r-devel, a source code patch attached)
Wacek Kusnierczyk wrote:
btw., i wonder why negative indices default to 1 in substr:

    substr('foobar', -5, 5)
    # "fooba"
    # substr('foobar', 1, 5)
    substr('foobar', 2, -2)
    # ""
    # substr('foobar', 2, 1)

this does not seem to be documented in ?substr.  there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):

    # hypothetical
    substr('foobar', -5, 5)
    # "ooba"
    # substr('foobar', 6-5+1, 5)
    substr('foobar', 2, -2)
    # "ooba"
    # substr('foobar', 2, 6-2+1)

there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch.  the patch has been
created and tested as follows:

    svn co https://svn.r-project.org/R/trunk r-devel
    cd r-devel
    # modifications made to src/main/character.c
    svn diff > character.c.diff
    svn revert -R .
    patch -p0 < character.c.diff
   
    ./configure
    make
    make check-all
    # no problems reported

with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:

    strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
    paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
        substr=substr(basename(strings), 1, -5),
        'substr-nchar'={
            basenames=basename(strings)
            substr(basenames, 1, nchar(basenames)-4) },
        sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
    #     test elapsed
    # 1       substr   2.981
    # 2 substr-nchar   3.206
    # 3          sub   3.273

if this sounds interesting, i can update the docs accordingly.

vQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: character.c.diff
Type: text/x-diff
Size: 597 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090528/d1381eb7/attachment.bin>
#
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
   > x<-c("ooo","good food","bad")
   > r<-regexpr("o+", x)
   > substring(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > substr(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > r
   [1]  1  2 -1
   attr(,"match.length")
   [1]  3  2 -1
   > attr(r,"match.length")+r-1
   [1]  3  3 -3
   attr(,"match.length")
   [1]  3  2 -1
#
William Dunlap wrote:
no; same output
no; same output
for the positive indices there is no change, as you might expect.

if i understand your concern, the issue is that regexpr returns -1 (with
the corresponding attribute -1) where there is no match.  in this case,
you expect "" as the substring. 

if there is no match, we have:

    start = r = -1 (the start you index provide)
    stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide)

for a string of length n, my patch computes the final indices as follows:

    start' = n + start - 1
    stop' = n + stop - 1

whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. 
that is, stop' < start', hence an empty string is returned, by virtue of
the original code.  (see the sources for details.)

does this answer your question?

vQ
#
Wacek Kusnierczyk wrote:
except for that stop - start = -3 - -1 = -2, but that's still negative,
i.e., stop' < start'.
silly me, sorry.

vQ