Skip to content

split strings

12 messages · Monica Pisica, ronggui, Gabor Grothendieck +3 more

#
Hi everybody,
 
I have a vector of characters and i would like to extract certain parts. My vector is named metr_list:
 
[1] "F:/Naval_Live_Oaks/2005/data//BE.tif"  
[2] "F:/Naval_Live_Oaks/2005/data//CH.tif"  
[3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" 
[4] "F:/Naval_Live_Oaks/2005/data//HOME.tif"

And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id" for example. I read the help files for sub and grep and the likes but i have to recognize that i did not understand it. So i've done this (which does the job but extremely clumsy):
 
b <- strsplit(metr_list, "//")
b <- unlist(b)
d <- strsplit(b, "\\.")
d <- unlist(d)
names.id <- d[c(2, 5, 8, 11)]

Can anybody show what would be the proper way to achieve this with some explanations?
 
Thanks,
 
Monica
_________________________________________________________________
Hotmail? goes with you. 

ial_Mobile1_052009
#
They look like file path, so you can make use of basename() first,
then use gsub to strip the suffix.
[1] "BE" "CH"

Ronggui

2009/5/26 Monica Pisica <pisicandru at hotmail.com>:

  
    
#
Try this:

sub(".tif$", "", basename(metr_list))
On Tue, May 26, 2009 at 9:27 AM, Monica Pisica <pisicandru at hotmail.com> wrote:
#
Monica Pisica wrote:
one way that seems reasonable is to use sub:

    output = sub('.*//(.*)[.]tif$', '\\1', input)

which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'.  if the pattern does not match, you
get the original input:

    sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
    # buz  

vQ
#
Hi everybody,
 
Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this.
 
Thanks again,
 
Monica

----------------------------------------
_________________________________________________________________
Hotmail? goes with you. 

ial_Mobile1_052009
#
Monica Pisica wrote:
you're welcome.  note that if efficiency is an issue, you'd better have
perl=TRUE there:

    output = sub('.*//(.*)[.]tif$', '\\1', input, perl=TRUE)

with perl=TRUE, the one-pass solution is somewhat faster than the
two-pass solution of gabor's -- which, however, is probably easier to
understand;  with perl=FALSE (the default), the performance drops:

    strings = sprintf(
        'f:/foo/bar//%s.tif',
        replicate(1000, paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
       'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
       'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
       'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings,
perl=FALSE),
       'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE))
    # 1    one-pass, perl   3.391
    # 2    two-pass, perl   4.944
    # 3 one-pass, no perl  18.836
    # 4 two-pass, no perl   5.191

vQ
#
Although speed is really immaterial here this is likely
to be faster than all shown so far:

sub(".tif", "", basename(metr_list), fixed = TRUE)

It does not allow file names with .tif in the middle
of them since it will delete the first occurrence rather
than the last but such a situation is highly unlikely.


On Tue, May 26, 2009 at 4:24 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
#
Immaterial, yes, but it is always good to test :) and your solution *is* 
faster and it is even faster if you can assume byte strings:

 > strings = sprintf('f:/foo/bar//%s.tif', replicate(1000, 
paste(sample(letters, 10), collapse='')))
 > library(rbenchmark)
 > benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
   'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
   'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
   'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=FALSE),
   'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE),
   'fixed'=sub(".tif", "", basename(strings), fixed=TRUE),
   'fixed, bytes'=sub(".tif", "", basename(strings), fixed=TRUE, 
useBytes=TRUE))

               test elapsed
1    one-pass, perl   2.946
2    two-pass, perl   3.858
3 one-pass, no perl  15.884
4 two-pass, no perl   3.788
5             fixed   2.264
6      fixed, bytes   1.813

Allan
Gabor Grothendieck wrote:
#
Allan Engelhardt wrote:
:)

indeed;  though if the speed is immaterial (and in this case it
supposedly was), it's probably not worth risking fixed=TRUE removing
'.tif' from the middle of the name, however unlikely this might be (cf
murphy's laws).

but if you can assume that each string ends with a '.tif' (or any other
\..{3} substring), then substr is marginally faster than sub, even as a
three-pass approach, while avoiding the risk of removing '.tif' from the
middle:

    strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
       substr={basenames=basename(strings); substr(basenames, 1,
nchar(basenames)-4)},
       sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
    #     test elapsed
    # 1 substr   3.176
    # 2    sub   3.296


vQ
#
Hi,
 
Luckily for me - until now i did not have too many times to do these type of parsing - but who knows???  Up to now i was pretty happy with strsplit .....Anyway - thanks again for all the help, i really appreciate it.
 
Monica

----------------------------------------
#
(diverted to r-devel, a source code patch attached)
Wacek Kusnierczyk wrote:
btw., i wonder why negative indices default to 1 in substr:

    substr('foobar', -5, 5)
    # "fooba"
    # substr('foobar', 1, 5)
    substr('foobar', 2, -2)
    # ""
    # substr('foobar', 2, 1)

this does not seem to be documented in ?substr.  there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):

    # hypothetical
    substr('foobar', -5, 5)
    # "ooba"
    # substr('foobar', 6-5+1, 5)
    substr('foobar', 2, -2)
    # "ooba"
    # substr('foobar', 2, 6-2+1)

there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch.  the patch has been
created and tested as follows:

    svn co https://svn.r-project.org/R/trunk r-devel
    cd r-devel
    # modifications made to src/main/character.c
    svn diff > character.c.diff
    svn revert -R .
    patch -p0 < character.c.diff
   
    ./configure
    make
    make check-all
    # no problems reported

with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:

    strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
    paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
        substr=substr(basename(strings), 1, -5),
        'substr-nchar'={
            basenames=basename(strings)
            substr(basenames, 1, nchar(basenames)-4) },
        sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
    #     test elapsed
    # 1       substr   2.981
    # 2 substr-nchar   3.206
    # 3          sub   3.273

if this sounds interesting, i can update the docs accordingly.

vQ
#
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
   > x<-c("ooo","good food","bad")
   > r<-regexpr("o+", x)
   > substring(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > substr(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > r
   [1]  1  2 -1
   attr(,"match.length")
   [1]  3  2 -1
   > attr(r,"match.length")+r-1
   [1]  3  3 -3
   attr(,"match.length")
   [1]  3  2 -1