Help with text separation

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111114/ec4245aa/attachment.pl>
Hi,

On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
Good morning R list,

My apologies if this has *already* answered elsewhere, but I have not found
the answer that I am looking for.

I have a character string, i.e.

form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')

Now, my aim is to find the position of all those instances of '*' and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and also
the variable name after the '*'. So, here I would like to remove '+L*M'
You just want to get rid of them? gsub() it is.

I've changed your formula a little bit to better demonstrate what's going on:
form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * M')
gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
[1] "~ A + C / D + E + E / F * G + H + I + J + K"

That regular expression will take out a
space
+
any capital letter
space
*
space
any capital letter.

It will take out all occurrences of that sequence, but won't take out
occurrences of * not in that sequence.

If you don't want the spaces, you don't need them. Just take them out
of the regular expression as well.

Not that strsplit() was remotely the right tool here, but you can
split into characters without a separator:
form <- 'abcd'
strsplit(form, '')
[[1]]
[1] "a" "b" "c" "d"

Sarah
So, far I have come up with the following code:

parts<-strsplit(form,' ')
index<-which(unlist(parts)=="*")
for (i in 1:length(index)){
? ?parts[[1]][index[i]]<-list(NULL)
? ?parts[[1]][index[i]+1]<-list(NULL)
? ?parts[[1]][index[i]-1]<-list(NULL)
? ?parts[[1]][index[i]-2]<-list(NULL)
}
new.form<-unlist(parts)

form<-new.form[0]
for (i in 1: length(new.form)){
? ?form<-paste(form,new.form[i], sep="")
}

However, as you can see, I have had to use strsplit in, what I consider a
rather clumsy manner, as the character string (form) has to be in a certain
format. All variables and maths operators require a space between them in
order for strsplit to work in the manner I require.

I would very much like to accomplish what the above code already does, but
without the need for the initial character string having the need for the
aforementioned spaces.

If the list can offer help, I would be most appreciative.

Yours

Mike Griffiths

Sarah Goslee
http://www.functionaldiversity.org
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111114/c948c9ec/attachment.pl>
Hi

r-help-bounces at r-project.org napsal dne 14.11.2011 14:54:05:
Thank you Sarah,

Your reply was very helpful. I have the added difficulty that I am not 
only
dealing with single A-Z characters, but quite often have the following
situation:

form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
benefit1+product+action+mean+CTA*help')

and again, I need to remove the +'CTA*help' part of the character 
string.
However, in another instance I may have

form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
benefit1+product+action+mean+CTA*help')

In this case I would need to remove 'Sentence*LEGAL+' from form.

Can this be accomplished in the same manner?
Hm. I am not at all an expert in regular expressions but recently I 
learned some ways (thanks Uwe)

sub("^(~)\\+(.+)\\+$", "\\1\\2", gsub("[[:alnum:]]+\\*[[:alnum:]]+", "", 
form))
[1] "~Intro+Intro/Intro1++benefit+benefit/benefit1+product+action+mean"

this will remove all values xxxxxx*yyyyy from your form together with 
leading and trailing +

I wonder if any automatic process can remove only one from several 
xxxxxx*yyyyy substrings.

Regards
Petr

PS and still it is not perfect as there is one middle + more.
Many thanks, once again, for your help

Mike Griffiths

On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee 

Hi,

On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
<griffiths at upstreamsystems.com> wrote:
Good morning R list,

My apologies if this has *already* answered elsewhere, but I have 
not
found
the answer that I am looking for.

I have a character string, i.e.

form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * 
M')
Now, my aim is to find the position of all those instances of '*' 
and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and 
also
the variable name after the '*'. So, here I would like to remove 
'+L*M'
You just want to get rid of them? gsub() it is.

I've changed your formula a little bit to better demonstrate what's 
going
on:
form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * 
M')
gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
[1] "~ A + C / D + E + E / F * G + H + I + J + K"

That regular expression will take out a
space
+
any capital letter
space
*
space
any capital letter.

It will take out all occurrences of that sequence, but won't take out
occurrences of * not in that sequence.

If you don't want the spaces, you don't need them. Just take them out
of the regular expression as well.

Not that strsplit() was remotely the right tool here, but you can
split into characters without a separator:
form <- 'abcd'
strsplit(form, '')
[[1]]
[1] "a" "b" "c" "d"

Sarah

So, far I have come up with the following code:

parts<-strsplit(form,' ')
index<-which(unlist(parts)=="*")
for (i in 1:length(index)){
   parts[[1]][index[i]]<-list(NULL)
   parts[[1]][index[i]+1]<-list(NULL)
   parts[[1]][index[i]-1]<-list(NULL)
   parts[[1]][index[i]-2]<-list(NULL)
}
new.form<-unlist(parts)

form<-new.form[0]
for (i in 1: length(new.form)){
   form<-paste(form,new.form[i], sep="")
}

However, as you can see, I have had to use strsplit in, what I 
consider a
rather clumsy manner, as the character string (form) has to be in a
certain
format. All variables and maths operators require a space between 
them in
order for strsplit to work in the manner I require.

I would very much like to accomplish what the above code already 
does,
but
without the need for the initial character string having the need 
for the
aforementioned spaces.

If the list can offer help, I would be most appreciative.

Yours

Mike Griffiths

--
Sarah Goslee
http://www.functionaldiversity.org

-- 

*Michael Griffiths, Ph.D
*Statistician

*Upstream Systems*

8th Floor
Portland House
Bressenden Place
SW1E 5BH

<http://www.google.com/url?q=http%3A%2F%2Fwww.upstreamsystems.com%
2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>

Tel   +44 (0) 20 7869 5147
Fax  +44 207 290 1321
Mob +44 789 4944 145

www.upstreamsystems.com<http://www.google.com/url?q=http%3A%2F%

2Fwww.upstreamsystems.com%2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>
*griffiths at upstreamsystems.com <einstein at upstreamsystems.com>*

<http://www.upstreamsystems.com/>

   [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi,

On Mon, Nov 14, 2011 at 8:54 AM, Michael Griffiths
Thank you Sarah,

Your reply was very helpful. I have the added difficulty that I am not only
dealing with single A-Z characters, but quite often have the following
situation:

form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')

and again, I need to remove the +'CTA*help' part of the character string.
However, in another instance I may have

form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')

In this case I would need to remove 'Sentence*LEGAL+' from form.

Can this be accomplished in the same manner?
Regular expressions are *very* powerful, so yes. You should read a good
intro to regular expressions, and pay careful attention to the word markers,
then take a look at the specifics of R's implementation.

Why do I send you to the help? Because the possible answers all look a
lot like this:
form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
gsub("\\+\\<\\w*\\>\\*\\<\\w*\\>", "", form)
[1] "~Sentence*LEGAL+Intro+Intro/Intro1+benefit+benefit/benefit1+product+action+mean"

Sarah
Many thanks, once again, for your help

Mike Griffiths

On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee <sarah.goslee at gmail.com>
wrote:
Hi,

On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
<griffiths at upstreamsystems.com> wrote:
Good morning R list,

My apologies if this has *already* answered elsewhere, but I have not
found
the answer that I am looking for.

I have a character string, i.e.

form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')

Now, my aim is to find the position of all those instances of '*' and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and also
the variable name after the '*'. So, here I would like to remove '+L*M'
You just want to get rid of them? gsub() it is.

I've changed your formula a little bit to better demonstrate what's going
on:
form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * M')
gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
[1] "~ A + C / D + E + E / F * G + H + I + J + K"

That regular expression will take out a
space
+
any capital letter
space
*
space
any capital letter.

It will take out all occurrences of that sequence, but won't take out
occurrences of * not in that sequence.

If you don't want the spaces, you don't need them. Just take them out
of the regular expression as well.

Not that strsplit() was remotely the right tool here, but you can
split into characters without a separator:
form <- 'abcd'
strsplit(form, '')
[[1]]
[1] "a" "b" "c" "d"

Sarah

So, far I have come up with the following code:

parts<-strsplit(form,' ')
index<-which(unlist(parts)=="*")
for (i in 1:length(index)){
? ?parts[[1]][index[i]]<-list(NULL)
? ?parts[[1]][index[i]+1]<-list(NULL)
? ?parts[[1]][index[i]-1]<-list(NULL)
? ?parts[[1]][index[i]-2]<-list(NULL)
}
new.form<-unlist(parts)

form<-new.form[0]
for (i in 1: length(new.form)){
? ?form<-paste(form,new.form[i], sep="")
}

However, as you can see, I have had to use strsplit in, what I consider
a
rather clumsy manner, as the character string (form) has to be in a
certain
format. All variables and maths operators require a space between them
in
order for strsplit to work in the manner I require.

I would very much like to accomplish what the above code already does,
but
without the need for the initial character string having the need for
the
aforementioned spaces.

If the list can offer help, I would be most appreciative.

Yours

Mike Griffiths

--
Sarah Goslee
http://www.functionaldiversity.org

Good morning R list,

My apologies if this has *already* answered elsewhere, but I have  
not found
the answer that I am looking for.

I have a character string, i.e.

form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')

Now, my aim is to find the position of all those instances of '*'  
and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and  
also
the variable name after the '*'. So, here I would like to remove  
'+L*M'
This would be a very narrow implementation that requires the +/spc/ 
alnum/spc/*/alnum sequence exactly;

 > sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]*", "", form)
[1] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "

This is a more general implementation using the "*" operator that  
matches each of the preceding item 0 or more times.

  form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M',
  '~ A + B + C + C / D + E + E / F + G + H + I + J + K + L*M',
   '~ A + B + C + C / D + E + E / F + G + H + I + J + K +Llll*M'
  )
 > sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]*", "", form)
[1] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "
[2] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "
[3] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "

---stripped out code---
David Winsemius, MD
West Hartford, CT