Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id" for example. I read the help files for sub and grep and the likes but i have to recognize that i did not understand it. So i've done this (which does the job but extremely clumsy): b <- strsplit(metr_list, "//") b <- unlist(b) d <- strsplit(b, "\\.") d <- unlist(d) names.id <- d[c(2, 5, 8, 11)] Can anybody show what would be the proper way to achieve this with some explanations? Thanks, Monica _________________________________________________________________ Hotmail? goes with you. ial_Mobile1_052009
split strings
12 messages · Monica Pisica, ronggui, Gabor Grothendieck +3 more
They look like file path, so you can make use of basename() first, then use gsub to strip the suffix.
x<-c("F:/Naval_Live_Oaks/2005/data//BE.tif","F:/Naval_Live_Oaks/2005/data//CH.tif")
x2<-sapply(x,basename,USE.NAMES=FALSE)
gsub("[.].{1,}$","",x2)
[1] "BE" "CH" Ronggui 2009/5/26 Monica Pisica <pisicandru at hotmail.com>:
Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id" for example. I read the help files for sub and grep and the likes but i have to recognize that i did not understand it. So i've done this (which does the job but extremely clumsy): b <- strsplit(metr_list, "//") b <- unlist(b) d <- strsplit(b, "\\.") d <- unlist(d) names.id <- d[c(2, 5, 8, 11)] Can anybody show what would be the proper way to achieve this with some explanations? Thanks, Monica
_________________________________________________________________ Hotmail? goes with you. ial_Mobile1_052009 ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
HUANG Ronggui, Wincent PhD Candidate Dept of Public and Social Administration City University of Hong Kong Home page: http://asrr.r-forge.r-project.org/rghuang.html
Try this:
sub(".tif$", "", basename(metr_list))
On Tue, May 26, 2009 at 9:27 AM, Monica Pisica <pisicandru at hotmail.com> wrote:
Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id" for example. I read the help files for sub and grep and the likes but i have to recognize that i did not understand it. So i've done this (which does the job but extremely clumsy): b <- strsplit(metr_list, "//") b <- unlist(b) d <- strsplit(b, "\\.") d <- unlist(d) names.id <- d[c(2, 5, 8, 11)] Can anybody show what would be the proper way to achieve this with some explanations? Thanks, Monica
_________________________________________________________________ Hotmail? goes with you. ial_Mobile1_052009 ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Monica Pisica wrote:
Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
one way that seems reasonable is to use sub:
output = sub('.*//(.*)[.]tif$', '\\1', input)
which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'. if the pattern does not match, you
get the original input:
sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
# buz
vQ
Hi everybody, Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this. Thanks again, Monica ----------------------------------------
Date: Tue, 26 May 2009 15:46:21 +0200 From: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no To: pisicandru at hotmail.com CC: r-help at r-project.org Subject: Re: [R] split strings Monica Pisica wrote:
Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
one way that seems reasonable is to use sub:
output = sub('.*//(.*)[.]tif$', '\\1', input)
which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'. if the pattern does not match, you
get the original input:
sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
# buz
vQ
_________________________________________________________________ Hotmail? goes with you. ial_Mobile1_052009
Monica Pisica wrote:
Hi everybody, Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this. Thanks again,
you're welcome. note that if efficiency is an issue, you'd better have
perl=TRUE there:
output = sub('.*//(.*)[.]tif$', '\\1', input, perl=TRUE)
with perl=TRUE, the one-pass solution is somewhat faster than the
two-pass solution of gabor's -- which, however, is probably easier to
understand; with perl=FALSE (the default), the performance drops:
strings = sprintf(
'f:/foo/bar//%s.tif',
replicate(1000, paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings,
perl=FALSE),
'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE))
# 1 one-pass, perl 3.391
# 2 two-pass, perl 4.944
# 3 one-pass, no perl 18.836
# 4 two-pass, no perl 5.191
vQ
Monica ----------------------------------------
Date: Tue, 26 May 2009 15:46:21 +0200
From: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
To: pisicandru at hotmail.com
CC: r-help at r-project.org
Subject: Re: [R] split strings
Monica Pisica wrote:
Hi everybody,
I have a vector of characters and i would like to extract certain parts. My vector is named metr_list:
[1] "F:/Naval_Live_Oaks/2005/data//BE.tif"
[2] "F:/Naval_Live_Oaks/2005/data//CH.tif"
[3] "F:/Naval_Live_Oaks/2005/data//CRR.tif"
[4] "F:/Naval_Live_Oaks/2005/data//HOME.tif"
And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
one way that seems reasonable is to use sub:
output = sub('.*//(.*)[.]tif$', '\\1', input)
which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'. if the pattern does not match, you
get the original input:
sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
# buz
vQ
_________________________________________________________________
Although speed is really immaterial here this is likely
to be faster than all shown so far:
sub(".tif", "", basename(metr_list), fixed = TRUE)
It does not allow file names with .tif in the middle
of them since it will delete the first occurrence rather
than the last but such a situation is highly unlikely.
On Tue, May 26, 2009 at 4:24 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
Monica Pisica wrote:
Hi everybody, Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this. Thanks again,
you're welcome. ?note that if efficiency is an issue, you'd better have
perl=TRUE there:
? ?output = sub('.*//(.*)[.]tif$', '\\1', input, perl=TRUE)
with perl=TRUE, the one-pass solution is somewhat faster than the
two-pass solution of gabor's -- which, however, is probably easier to
understand; ?with perl=FALSE (the default), the performance drops:
? ?strings = sprintf(
? ? ? ?'f:/foo/bar//%s.tif',
? ? ? ?replicate(1000, paste(sample(letters, 10), collapse='')))
? ?library(rbenchmark)
? ?benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
? ? ? 'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
? ? ? 'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
? ? ? 'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings,
perl=FALSE),
? ? ? 'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE))
? ?# 1 ? ?one-pass, perl ? 3.391
? ?# 2 ? ?two-pass, perl ? 4.944
? ?# 3 one-pass, no perl ?18.836
? ?# 4 two-pass, no perl ? 5.191
vQ
Monica ----------------------------------------
Date: Tue, 26 May 2009 15:46:21 +0200 From: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no To: pisicandru at hotmail.com CC: r-help at r-project.org Subject: Re: [R] split strings Monica Pisica wrote:
Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
one way that seems reasonable is to use sub:
output = sub('.*//(.*)[.]tif$', '\\1', input)
which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'. if the pattern does not match, you
get the original input:
sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
# buz
vQ
_________________________________________________________________
Immaterial, yes, but it is always good to test :) and your solution *is*
faster and it is even faster if you can assume byte strings:
> strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
> library(rbenchmark)
> benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=FALSE),
'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE),
'fixed'=sub(".tif", "", basename(strings), fixed=TRUE),
'fixed, bytes'=sub(".tif", "", basename(strings), fixed=TRUE,
useBytes=TRUE))
test elapsed
1 one-pass, perl 2.946
2 two-pass, perl 3.858
3 one-pass, no perl 15.884
4 two-pass, no perl 3.788
5 fixed 2.264
6 fixed, bytes 1.813
Allan
Gabor Grothendieck wrote:
Although speed is really immaterial here this is likely
to be faster than all shown so far:
sub(".tif", "", basename(metr_list), fixed = TRUE)
It does not allow file names with .tif in the middle
of them since it will delete the first occurrence rather
than the last but such a situation is highly unlikely.
On Tue, May 26, 2009 at 4:24 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
Monica Pisica wrote:
Hi everybody,
Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this.
Thanks again,
you're welcome. note that if efficiency is an issue, you'd better have
perl=TRUE there:
output = sub('.*//(.*)[.]tif$', '\\1', input, perl=TRUE)
with perl=TRUE, the one-pass solution is somewhat faster than the
two-pass solution of gabor's -- which, however, is probably easier to
understand; with perl=FALSE (the default), the performance drops:
strings = sprintf(
'f:/foo/bar//%s.tif',
replicate(1000, paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings,
perl=FALSE),
'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE))
# 1 one-pass, perl 3.391
# 2 two-pass, perl 4.944
# 3 one-pass, no perl 18.836
# 4 two-pass, no perl 5.191
vQ
Monica
----------------------------------------
Date: Tue, 26 May 2009 15:46:21 +0200
From: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
To: pisicandru at hotmail.com
CC: r-help at r-project.org
Subject: Re: [R] split strings
Monica Pisica wrote:
Hi everybody,
I have a vector of characters and i would like to extract certain parts. My vector is named metr_list:
[1] "F:/Naval_Live_Oaks/2005/data//BE.tif"
[2] "F:/Naval_Live_Oaks/2005/data//CH.tif"
[3] "F:/Naval_Live_Oaks/2005/data//CRR.tif"
[4] "F:/Naval_Live_Oaks/2005/data//HOME.tif"
And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
one way that seems reasonable is to use sub:
output = sub('.*//(.*)[.]tif$', '\\1', input)
which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'. if the pattern does not match, you
get the original input:
sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
# buz
vQ
_________________________________________________________________
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Allan Engelhardt wrote:
Immaterial, yes, but it is always good to test :) and your solution *is* faster and it is even faster if you can assume byte strings:
:)
indeed; though if the speed is immaterial (and in this case it
supposedly was), it's probably not worth risking fixed=TRUE removing
'.tif' from the middle of the name, however unlikely this might be (cf
murphy's laws).
but if you can assume that each string ends with a '.tif' (or any other
\..{3} substring), then substr is marginally faster than sub, even as a
three-pass approach, while avoiding the risk of removing '.tif' from the
middle:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr={basenames=basename(strings); substr(basenames, 1,
nchar(basenames)-4)},
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1 substr 3.176
# 2 sub 3.296
vQ
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=FALSE),
'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE),
'fixed'=sub(".tif", "", basename(strings), fixed=TRUE),
'fixed, bytes'=sub(".tif", "", basename(strings), fixed=TRUE,
useBytes=TRUE))
test elapsed
1 one-pass, perl 2.946
2 two-pass, perl 3.858
3 one-pass, no perl 15.884
4 two-pass, no perl 3.788
5 fixed 2.264
6 fixed, bytes 1.813
Hi, Luckily for me - until now i did not have too many times to do these type of parsing - but who knows??? Up to now i was pretty happy with strsplit .....Anyway - thanks again for all the help, i really appreciate it. Monica ----------------------------------------
From: ggrothendieck at gmail.com
Date: Tue, 26 May 2009 16:40:21 -0400
Subject: Re: [R] split strings
To: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
CC: pisicandru at hotmail.com; r-help at r-project.org
Although speed is really immaterial here this is likely
to be faster than all shown so far:
sub(".tif", "", basename(metr_list), fixed = TRUE)
It does not allow file names with .tif in the middle
of them since it will delete the first occurrence rather
than the last but such a situation is highly unlikely.
On Tue, May 26, 2009 at 4:24 PM, Wacek Kusnierczyk
wrote:
Monica Pisica wrote:
Hi everybody, Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this. Thanks again,
you're welcome. note that if efficiency is an issue, you'd better have
perl=TRUE there:
output = sub('.*//(.*)[.]tif$', '\\1', input, perl=TRUE)
with perl=TRUE, the one-pass solution is somewhat faster than the
two-pass solution of gabor's -- which, however, is probably easier to
understand; with perl=FALSE (the default), the performance drops:
strings = sprintf(
'f:/foo/bar//%s.tif',
replicate(1000, paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings,
perl=FALSE),
'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE))
# 1 one-pass, perl 3.391
# 2 two-pass, perl 4.944
# 3 one-pass, no perl 18.836
# 4 two-pass, no perl 5.191
vQ
Monica ----------------------------------------
Date: Tue, 26 May 2009 15:46:21 +0200 From: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no To: pisicandru at hotmail.com CC: r-help at r-project.org Subject: Re: [R] split strings Monica Pisica wrote:
Hi everybody, I have a vector of characters and i would like to extract certain parts. My vector is named metr_list: [1] "F:/Naval_Live_Oaks/2005/data//BE.tif" [2] "F:/Naval_Live_Oaks/2005/data//CH.tif" [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif" [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif" And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
one way that seems reasonable is to use sub:
output = sub('.*//(.*)[.]tif$', '\\1', input)
which says 'from each string remember the substring between the
rigthmost two slashes and a .tif extension, exclusive, and replace the
whole thing with the captured part'. if the pattern does not match, you
get the original input:
sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
# buz
vQ
_________________________________________________________________
(diverted to r-devel, a source code patch attached)
Wacek Kusnierczyk wrote:
Allan Engelhardt wrote:
Immaterial, yes, but it is always good to test :) and your solution
*is* faster and it is even faster if you can assume byte strings:
:)
indeed; though if the speed is immaterial (and in this case it
supposedly was), it's probably not worth risking fixed=TRUE removing
'.tif' from the middle of the name, however unlikely this might be (cf
murphy's laws).
but if you can assume that each string ends with a '.tif' (or any other
\..{3} substring), then substr is marginally faster than sub, even as a
three-pass approach, while avoiding the risk of removing '.tif' from the
middle:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr={basenames=basename(strings); substr(basenames, 1,
nchar(basenames)-4)},
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1 substr 3.176
# 2 sub 3.296
btw., i wonder why negative indices default to 1 in substr:
substr('foobar', -5, 5)
# "fooba"
# substr('foobar', 1, 5)
substr('foobar', 2, -2)
# ""
# substr('foobar', 2, 1)
this does not seem to be documented in ?substr. there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):
# hypothetical
substr('foobar', -5, 5)
# "ooba"
# substr('foobar', 6-5+1, 5)
substr('foobar', 2, -2)
# "ooba"
# substr('foobar', 2, 6-2+1)
there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch. the patch has been
created and tested as follows:
svn co https://svn.r-project.org/R/trunk r-devel
cd r-devel
# modifications made to src/main/character.c
svn diff > character.c.diff
svn revert -R .
patch -p0 < character.c.diff
./configure
make
make check-all
# no problems reported
with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr=substr(basename(strings), 1, -5),
'substr-nchar'={
basenames=basename(strings)
substr(basenames, 1, nchar(basenames)-4) },
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1 substr 2.981
# 2 substr-nchar 3.206
# 3 sub 3.273
if this sounds interesting, i can update the docs accordingly.
vQ
Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com
-----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk Sent: Thursday, May 28, 2009 5:30 AM Cc: R help project; r-devel at r-project.org; Allan Engelhardt Subject: Re: [Rd] [R] split strings (diverted to r-devel, a source code patch attached) Wacek Kusnierczyk wrote:
Allan Engelhardt wrote:
Immaterial, yes, but it is always good to test :) and your solution
*is* faster and it is even faster if you can assume byte strings:
:) indeed; though if the speed is immaterial (and in this case it supposedly was), it's probably not worth risking fixed=TRUE removing '.tif' from the middle of the name, however unlikely this
might be (cf
murphy's laws). but if you can assume that each string ends with a '.tif'
(or any other
\..{3} substring), then substr is marginally faster than
sub, even as a
three-pass approach, while avoiding the risk of removing
'.tif' from the
middle:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'),
replications=1000, order=NULL,
substr={basenames=basename(strings); substr(basenames, 1,
nchar(basenames)-4)},
sub=sub('.tif', '', basename(strings), fixed=TRUE,
useBytes=TRUE))
# test elapsed
# 1 substr 3.176
# 2 sub 3.296
btw., i wonder why negative indices default to 1 in substr:
substr('foobar', -5, 5)
# "fooba"
# substr('foobar', 1, 5)
substr('foobar', 2, -2)
# ""
# substr('foobar', 2, 1)
this does not seem to be documented in ?substr.
Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
> x<-c("ooo","good food","bad")
> r<-regexpr("o+", x)
> substring(x,r,attr(r,"match.length")+r-1)
[1] "ooo" "oo" ""
> substr(x,r,attr(r,"match.length")+r-1)
[1] "ooo" "oo" ""
> r
[1] 1 2 -1
attr(,"match.length")
[1] 3 2 -1
> attr(r,"match.length")+r-1
[1] 3 3 -3
attr(,"match.length")
[1] 3 2 -1
there are
ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):
# hypothetical
substr('foobar', -5, 5)
# "ooba"
# substr('foobar', 6-5+1, 5)
substr('foobar', 2, -2)
# "ooba"
# substr('foobar', 2, 6-2+1)
there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch. the patch has been
created and tested as follows:
svn co https://svn.r-project.org/R/trunk r-devel
cd r-devel
# modifications made to src/main/character.c
svn diff > character.c.diff
svn revert -R .
patch -p0 < character.c.diff
./configure
make
make check-all
# no problems reported
with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still
better than
the sub/fixed/bytes one, as follows:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'),
replications=1000, order=NULL,
substr=substr(basename(strings), 1, -5),
'substr-nchar'={
basenames=basename(strings)
substr(basenames, 1, nchar(basenames)-4) },
sub=sub('.tif', '', basename(strings), fixed=TRUE,
useBytes=TRUE))
# test elapsed
# 1 substr 2.981
# 2 substr-nchar 3.206
# 3 sub 3.273
if this sounds interesting, i can update the docs accordingly.
vQ