split strings - R-devel | R Mailing Lists

Thu, May 28, 2009 5:30 AM #

(diverted to r-devel, a source code patch attached)

Wacek Kusnierczyk wrote:

btw., i wonder why negative indices default to 1 in substr:

    substr('foobar', -5, 5)
    # "fooba"
    # substr('foobar', 1, 5)
    substr('foobar', 2, -2)
    # ""
    # substr('foobar', 2, 1)

this does not seem to be documented in ?substr.  there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):

    # hypothetical
    substr('foobar', -5, 5)
    # "ooba"
    # substr('foobar', 6-5+1, 5)
    substr('foobar', 2, -2)
    # "ooba"
    # substr('foobar', 2, 6-2+1)

there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch.  the patch has been
created and tested as follows:

    svn co https://svn.r-project.org/R/trunk r-devel
    cd r-devel
    # modifications made to src/main/character.c
    svn diff > character.c.diff
    svn revert -R .
    patch -p0 < character.c.diff
   
    ./configure
    make
    make check-all
    # no problems reported

with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:

    strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
    paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
        substr=substr(basename(strings), 1, -5),
        'substr-nchar'={
            basenames=basename(strings)
            substr(basenames, 1, nchar(basenames)-4) },
        sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
    #     test elapsed
    # 1       substr   2.981
    # 2 substr-nchar   3.206
    # 3          sub   3.273

if this sounds interesting, i can update the docs accordingly.

vQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: character.c.diff
Type: text/x-diff
Size: 597 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090528/d1381eb7/attachment.bin>

William Dunlap

Thu, May 28, 2009 6:23 AM #

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-devel-bounces at r-project.org 
[mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk
Sent: Thursday, May 28, 2009 5:30 AM
Cc: R help project; r-devel at r-project.org; Allan Engelhardt
Subject: Re: [Rd] [R] split strings

(diverted to r-devel, a source code patch attached)

Wacek Kusnierczyk wrote:

Allan Engelhardt wrote:

Immaterial, yes, but it is always good to test :) and your solution
*is* faster and it is even faster if you can assume byte strings:

:)

indeed;  though if the speed is immaterial (and in this case it
supposedly was), it's probably not worth risking fixed=TRUE removing
'.tif' from the middle of the name, however unlikely this

might be (cf

murphy's laws).

but if you can assume that each string ends with a '.tif'

(or any other

\..{3} substring), then substr is marginally faster than

sub, even as a

three-pass approach, while avoiding the risk of removing

'.tif' from the

middle:

    strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'),

replications=1000, order=NULL,

       substr={basenames=basename(strings); substr(basenames, 1,
nchar(basenames)-4)},
       sub=sub('.tif', '', basename(strings), fixed=TRUE,

useBytes=TRUE))

    #     test elapsed
    # 1 substr   3.176
    # 2    sub   3.296

btw., i wonder why negative indices default to 1 in substr:

    substr('foobar', -5, 5)
    # "fooba"
    # substr('foobar', 1, 5)
    substr('foobar', 2, -2)
    # ""
    # substr('foobar', 2, 1)

this does not seem to be documented in ?substr.

Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
   > x<-c("ooo","good food","bad")
   > r<-regexpr("o+", x)
   > substring(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > substr(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > r
   [1]  1  2 -1
   attr(,"match.length")
   [1]  3  2 -1
   > attr(r,"match.length")+r-1
   [1]  3  3 -3
   attr(,"match.length")
   [1]  3  2 -1

Wacek Kusnierczyk

Thu, May 28, 2009 6:50 AM #

William Dunlap wrote:

no; same output

no; same output

for the positive indices there is no change, as you might expect.

if i understand your concern, the issue is that regexpr returns -1 (with
the corresponding attribute -1) where there is no match.  in this case,
you expect "" as the substring. 

if there is no match, we have:

    start = r = -1 (the start you index provide)
    stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide)

for a string of length n, my patch computes the final indices as follows:

    start' = n + start - 1
    stop' = n + stop - 1

whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. 
that is, stop' < start', hence an empty string is returned, by virtue of
the original code.  (see the sources for details.)

does this answer your question?

vQ

Wacek Kusnierczyk

Thu, May 28, 2009 7:05 AM #

Wacek Kusnierczyk wrote:

except for that stop - start = -3 - -1 = -2, but that's still negative,
i.e., stop' < start'.
silly me, sorry.

vQ