Is there a better way to parse strings than this?
Thanks for the suggestions, they were all exactly what I was looking for. (I knew that had to be a more elegant way then my brute force method) One question though. I was playing around with strsplit but couldn't get it to work, I realised my problem was that I was using "." as the string. I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables and Ripleys book to "(use '\.' to match '.')", which is in the Regular expressions section. I noticed that in the suggestions sent to me people used: strsplit(test,"\\.\\.\\.") Could anyone please explain why I should have used "\\.\\.\\." rather than "\.\.\."? Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP Commercialisation and Innovation, Data Analysis, Modelling and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 chris at trickysolutions.com.au -----Original Message----- From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] Sent: Wednesday, 13 April 2011 10:55 PM To: Chris Howden Cc: r-help at r-project.org Subject: Re: [R] Is there a better way to parse strings than this? On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
Hi Everyone, I needed to parse some strings recently. The code I've wound up using seems rather clunky, and I was wondering if anyone had any suggestions on a better way? Basically I do the following: 1) Use substr() to do the parsing 2) Use regexpr() to find the location of the string I want to parse on,
I
then pass this onto substr() 3) Use nchar() as the stop input to substr() where necessary I've got a simple example of the parsing code I used below. It takes questionnaire variable names that includes the question and the brand it was answered for and then parses it so the variable name and the brand
are
in separate columns. I then use this to restructure the data from unstacked to stacked, but that's another story.
# this is the data set test
[1] "A5.Brands.bought...Dulux" [2] "A5.Brands.bought...Haymes" [3] "A5.Brands.bought...Solver" [4] "A5.Brands.bought...Taubmans.or.Bristol" [5] "A5.Brands.bought...Wattyl" [6] "A5.Brands.bought...Other"
# Where do I want to parse?
break1 <- ?regexpr('...',test, fixed=TRUE)
break1
[1] 17 17 17 17 17 17 attr(,"match.length") [1] 3 3 3 3 3 3
# Put Variable name in a variable str1 <- substr(test,1,break1-1) str1
[1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" [5] "A5.Brands.bought" "A5.Brands.bought"
# Put Brand name in a variable str2 <- substr(test,break1+3, nchar(test)) str2
[1] "Dulux" ? ? ? ? ? ? ? "Haymes" ? ? ? ? ? ? ?"Solver" [4] "Taubmans.or.Bristol" "Wattyl" ? ? ? ? ? ? ?"Other"
Try this:
x <- c("A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes",
+ "A5.Brands.bought...Solver")
do.call(rbind, strsplit(x, "...", fixed = TRUE))
[,1] [,2] [1,] "A5.Brands.bought" "Dulux" [2,] "A5.Brands.bought" "Haymes" [3,] "A5.Brands.bought" "Solver"
# or
xa <- sub("...", "\1", x, fixed = TRUE)
read.table(textConnection(xa), sep = "\1", as.is = TRUE)
V1 V2 1 A5.Brands.bought Dulux 2 A5.Brands.bought Haymes 3 A5.Brands.bought Solver -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com