Skip to content

Is there a better way to parse strings than this?

9 messages · Chris Howden, Dennis Murphy, Hadley Wickham +2 more

#
Hi Everyone,


I needed to parse some strings recently.

The code I've wound up using seems rather clunky, and I was wondering if
anyone had any suggestions on a better way?

Basically I do the following:

1) Use substr() to do the parsing
2) Use regexpr() to find the location of the string I want to parse on, I
then pass this onto substr()
3) Use nchar() as the stop input to substr() where necessary



I've got a simple example of the parsing code I used below. It takes
questionnaire variable names that includes the question and the brand it
was answered for and then parses it so the variable name and the brand are
in separate columns. I then use this to restructure the data from
unstacked to stacked, but that's another story.
[1] "A5.Brands.bought...Dulux"
[2] "A5.Brands.bought...Haymes"
[3] "A5.Brands.bought...Solver"
[4] "A5.Brands.bought...Taubmans.or.Bristol"
[5] "A5.Brands.bought...Wattyl"
[6] "A5.Brands.bought...Other"
[1] 17 17 17 17 17 17
attr(,"match.length")
[1] 3 3 3 3 3 3
[1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
"A5.Brands.bought"
[5] "A5.Brands.bought" "A5.Brands.bought"
[1] "Dulux"               "Haymes"              "Solver"
[4] "Taubmans.or.Bristol" "Wattyl"              "Other"



Thanks for any and all suggestions


Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
chris at trickysolutions.com.au
#
On Wed, Apr 13, 2011 at 5:18 AM, Dennis Murphy <djmuser at gmail.com> wrote:
Or with stringr:

library(stringr)
str_split_fixed(strings, fixed("..."), n = 2)

# or maybe
str_match(strings, "(..).*\\.\\.\\.(.*)")

Hadley
#
On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
Try this:
+ "A5.Brands.bought...Solver")
[,1]               [,2]
[1,] "A5.Brands.bought" "Dulux"
[2,] "A5.Brands.bought" "Haymes"
[3,] "A5.Brands.bought" "Solver"
V1     V2
1 A5.Brands.bought  Dulux
2 A5.Brands.bought Haymes
3 A5.Brands.bought Solver
1 day later
#
Thanks for the suggestions, they were all exactly what I was looking for.
(I knew that had to be a more elegant way then my brute force method)

One question though.

I was playing around with strsplit but couldn't get it to work, I realised
my problem was that I was using "." as the string.

I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables
and Ripleys book to "(use '\.' to match '.')", which is in the Regular
expressions section.

I noticed that in the suggestions sent to me people used:
strsplit(test,"\\.\\.\\.")


Could anyone please explain why I should have used "\\.\\.\\." rather than
"\.\.\."?



Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
chris at trickysolutions.com.au


-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Wednesday, 13 April 2011 10:55 PM
To: Chris Howden
Cc: r-help at r-project.org
Subject: Re: [R] Is there a better way to parse strings than this?

On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
I
are
Try this:
+ "A5.Brands.bought...Solver")
[,1]               [,2]
[1,] "A5.Brands.bought" "Dulux"
[2,] "A5.Brands.bought" "Haymes"
[3,] "A5.Brands.bought" "Solver"
V1     V2
1 A5.Brands.bought  Dulux
2 A5.Brands.bought Haymes
3 A5.Brands.bought Solver


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
#
Basically,

 * you want to match .
 * so the regular expression you need is \.
 * and the way you represent that in a string in R is \\.

Hadley
#
On Thu, Apr 14, 2011 at 8:28 PM, Chris Howden
<chris at trickysolutions.com.au> wrote:
"\\.\\.\\." is the string \.\.\.   For example, try this
\.\.\.
#
not everything has to be done in R.

awk and sed are some of the best tools on a linux/unix box.

quick refs:
http://www.pement.org/awk/awk1line.txt
http://sed.sourceforge.net/sed1line.txt

-Whit


On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
3 days later
#
Thanks for the explanation,

I think I understand it now. So to paraphrase all your explanations

To match "." in a regular expression then the string "\.\.\." needs to be
passed to it. This tells it to escape the special meaning of ".". But in
order to get the \ into the string being passed to the function I also
need to escape its special meaning, so I need to use "\\.\\.\\."



Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
chris at trickysolutions.com.au


-----Original Message-----
From: h.wickham at gmail.com [mailto:h.wickham at gmail.com] On Behalf Of Hadley
Wickham
Sent: Friday, 15 April 2011 11:07 AM
To: Chris Howden
Cc: r-help at r-project.org
Subject: Re: [R] Is there a better way to parse strings than this?
than
Basically,

 * you want to match .
 * so the regular expression you need is \.
 * and the way you represent that in a string in R is \\.

Hadley