help with regexpr in gsub

Wed, Jan 17, 2007 5:44 PM

Thanks for 6 ways to skin this cat! I am just beginning to learn about
the power of regular expressions and appreciate the many examples of how
they can be used in this context. This knowledge will come in handy the
next time the number of characters is variable both before and after the
dot. On my machine and for my particular example, however, Seth is
correct in that substr is by far the fastest. I had forgotten that
substr is vectorized.

Below is the output of my speed trials and sessionInfo in case anyone is
curious. I artificially made the go.id vector 10X its normal length to
magnify differences. I did also check to verify that each solution
worked as predicted, which they all did.

Thanks again for your generous help, Mark

length(go.ids)
[1] 79750

[1] "GO:0006091.NA"  "GO:0008104.ISS" "GO:0008104.ISS" "GO:0006091.NA"
"GO:0006091.NAS"

[1] 0.47 0.00 0.47   NA   NA

[1] 0.56 0.00 0.56   NA   NA

[1] 1.08 0.00 1.09   NA   NA

[1] 1.03 0.00 1.03   NA   NA

[1] 0.49 0.00 0.48   NA   NA

[1] 0.02 0.00 0.01   NA   NA

R version 2.4.1 (2006-12-18) 
i386-pc-mingw32 

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] "splines"   "stats"     "graphics"  "grDevices" "datasets"  "utils"
"tools"     "methods"   "base"     

other attached packages:
        rat2302 xlsReadWritePro          qvalue   affycoretools
biomaRt           RCurl             XML         GOstats        Category 
       "1.14.0"         "1.0.6"         "1.8.0"         "1.6.0"
"1.8.1"         "0.8-0"         "1.2-0"         "2.0.4"         "2.0.3" 
     genefilter        survival            KEGG            RBGL
annotate              GO           graph         RWinEdt           limma

       "1.12.0"          "2.30"        "1.14.1"        "1.10.0"
"1.12.1"        "1.14.1"        "1.12.0"         "1.7-5"         "2.9.1"

           affy          affyio         Biobase 
       "1.12.2"         "1.2.0"        "1.12.2"

Mark W. Kimpel MD 

 

(317) 490-5129 Work, & Mobile

 

(317) 663-0513 Home (no voice mail please)

1-(317)-536-2730 FAX


-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at comcast.net] 
Sent: Wednesday, January 17, 2007 8:11 PM
To: Seth Falcon
Cc: Kimpel, Mark William; r-help at stat.math.ethz.ch
Subject: Re: [R] help with regexpr in gsub

On Wed, 2007-01-17 at 16:46 -0800, Seth Falcon wrote:

follows

characters

this

might

code

"GO:0000004.ISS"

"GO:0000010.ISS"

I think that some of the overhead here in using sub() is due to the
effective partitioning of the source vector, a more complex regex and
then just returning the first element.

This can be shortened to:

# Note that I have 12 elements here

[1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS"
 [5] "GO:0000005.ISS" "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000008.ISS"
 [9] "GO:0000009.ISS" "GO:0000010.ISS" "GO:0000011.ISS" "GO:0000012.ISS"

[1] 0 0 0 0 0

[1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000004" "GO:0000005"
 [6] "GO:0000006" "GO:0000007" "GO:0000008" "GO:0000009" "GO:0000010"
[11] "GO:0000011" "GO:0000012"


Which would appear to be quicker than using substr().

HTH,

Marc Schwartz

help with regexpr in gsub

Thread (10 messages)