An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111114/ec4245aa/attachment.pl>
Help with text separation
6 messages · Michael Griffiths, PIKAL Petr, Sarah Goslee +1 more
Hi, On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
<griffiths at upstreamsystems.com> wrote:
Good morning R list,
My apologies if this has *already* answered elsewhere, but I have not found
the answer that I am looking for.
I have a character string, i.e.
form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')
Now, my aim is to find the position of all those instances of '*' and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and also
the variable name after the '*'. So, here I would like to remove '+L*M'
You just want to get rid of them? gsub() it is. I've changed your formula a little bit to better demonstrate what's going on:
form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * M')
gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
[1] "~ A + C / D + E + E / F * G + H + I + J + K" That regular expression will take out a space + any capital letter space * space any capital letter. It will take out all occurrences of that sequence, but won't take out occurrences of * not in that sequence. If you don't want the spaces, you don't need them. Just take them out of the regular expression as well. Not that strsplit() was remotely the right tool here, but you can split into characters without a separator:
form <- 'abcd' strsplit(form, '')
[[1]] [1] "a" "b" "c" "d" Sarah
So, far I have come up with the following code:
parts<-strsplit(form,' ')
index<-which(unlist(parts)=="*")
for (i in 1:length(index)){
? ?parts[[1]][index[i]]<-list(NULL)
? ?parts[[1]][index[i]+1]<-list(NULL)
? ?parts[[1]][index[i]-1]<-list(NULL)
? ?parts[[1]][index[i]-2]<-list(NULL)
}
new.form<-unlist(parts)
form<-new.form[0]
for (i in 1: length(new.form)){
? ?form<-paste(form,new.form[i], sep="")
}
However, as you can see, I have had to use strsplit in, what I consider a
rather clumsy manner, as the character string (form) has to be in a certain
format. All variables and maths operators require a space between them in
order for strsplit to work in the manner I require.
I would very much like to accomplish what the above code already does, but
without the need for the initial character string having the need for the
aforementioned spaces.
If the list can offer help, I would be most appreciative.
Yours
Mike Griffiths
Sarah Goslee http://www.functionaldiversity.org
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111114/c948c9ec/attachment.pl>
Hi r-help-bounces at r-project.org napsal dne 14.11.2011 14:54:05:
Thank you Sarah, Your reply was very helpful. I have the added difficulty that I am not
only
dealing with single A-Z characters, but quite often have the following
situation:
form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
benefit1+product+action+mean+CTA*help')
and again, I need to remove the +'CTA*help' part of the character
string.
However, in another instance I may have
form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
benefit1+product+action+mean+CTA*help')
In this case I would need to remove 'Sentence*LEGAL+' from form.
Can this be accomplished in the same manner?
Hm. I am not at all an expert in regular expressions but recently I
learned some ways (thanks Uwe)
sub("^(~)\\+(.+)\\+$", "\\1\\2", gsub("[[:alnum:]]+\\*[[:alnum:]]+", "",
form))
[1] "~Intro+Intro/Intro1++benefit+benefit/benefit1+product+action+mean"
this will remove all values xxxxxx*yyyyy from your form together with
leading and trailing +
I wonder if any automatic process can remove only one from several
xxxxxx*yyyyy substrings.
Regards
Petr
PS and still it is not perfect as there is one middle + more.
Many thanks, once again, for your help Mike Griffiths On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee
<sarah.goslee at gmail.com>wrote:
Hi, On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths <griffiths at upstreamsystems.com> wrote:
Good morning R list, My apologies if this has *already* answered elsewhere, but I have
not
found
the answer that I am looking for.
I have a character string, i.e.
form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L *
M')
Now, my aim is to find the position of all those instances of '*'
and to
remove said '*'. However, I would also like to remove the preceding variable name before the '*', the math operator preceding this, and
also
the variable name after the '*'. So, here I would like to remove
'+L*M'
You just want to get rid of them? gsub() it is. I've changed your formula a little bit to better demonstrate what's
going
on:
form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L *
M')
gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
[1] "~ A + C / D + E + E / F * G + H + I + J + K" That regular expression will take out a space + any capital letter space * space any capital letter. It will take out all occurrences of that sequence, but won't take out occurrences of * not in that sequence. If you don't want the spaces, you don't need them. Just take them out of the regular expression as well. Not that strsplit() was remotely the right tool here, but you can split into characters without a separator:
form <- 'abcd' strsplit(form, '')
[[1]] [1] "a" "b" "c" "d" Sarah
So, far I have come up with the following code:
parts<-strsplit(form,' ')
index<-which(unlist(parts)=="*")
for (i in 1:length(index)){
parts[[1]][index[i]]<-list(NULL)
parts[[1]][index[i]+1]<-list(NULL)
parts[[1]][index[i]-1]<-list(NULL)
parts[[1]][index[i]-2]<-list(NULL)
}
new.form<-unlist(parts)
form<-new.form[0]
for (i in 1: length(new.form)){
form<-paste(form,new.form[i], sep="")
}
However, as you can see, I have had to use strsplit in, what I
consider a
rather clumsy manner, as the character string (form) has to be in a
certain
format. All variables and maths operators require a space between
them in
order for strsplit to work in the manner I require. I would very much like to accomplish what the above code already
does,
but
without the need for the initial character string having the need
for the
aforementioned spaces. If the list can offer help, I would be most appreciative. Yours Mike Griffiths
-- Sarah Goslee http://www.functionaldiversity.org
-- *Michael Griffiths, Ph.D *Statistician *Upstream Systems* 8th Floor Portland House Bressenden Place SW1E 5BH <http://www.google.com/url?q=http%3A%2F%2Fwww.upstreamsystems.com% 2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw> Tel +44 (0) 20 7869 5147 Fax +44 207 290 1321 Mob +44 789 4944 145 www.upstreamsystems.com<http://www.google.com/url?q=http%3A%2F%
2Fwww.upstreamsystems.com%2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>
*griffiths at upstreamsystems.com <einstein at upstreamsystems.com>* <http://www.upstreamsystems.com/> [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi, On Mon, Nov 14, 2011 at 8:54 AM, Michael Griffiths
<griffiths at upstreamsystems.com> wrote:
Thank you Sarah,
Your reply was very helpful. I have the added difficulty that I am not only
dealing with single A-Z characters, but quite often have the following
situation:
form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
and again, I need to remove the +'CTA*help' part of the character string.
However, in another instance I may have
form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
In this case I would need to remove 'Sentence*LEGAL+' from form.
Can this be accomplished in the same manner?
Regular expressions are *very* powerful, so yes. You should read a good intro to regular expressions, and pay careful attention to the word markers, then take a look at the specifics of R's implementation. Why do I send you to the help? Because the possible answers all look a lot like this:
form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
gsub("\\+\\<\\w*\\>\\*\\<\\w*\\>", "", form)
[1] "~Sentence*LEGAL+Intro+Intro/Intro1+benefit+benefit/benefit1+product+action+mean" Sarah
Many thanks, once again, for your help Mike Griffiths On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
Hi, On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths <griffiths at upstreamsystems.com> wrote:
Good morning R list,
My apologies if this has *already* answered elsewhere, but I have not
found
the answer that I am looking for.
I have a character string, i.e.
form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')
Now, my aim is to find the position of all those instances of '*' and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and also
the variable name after the '*'. So, here I would like to remove '+L*M'
You just want to get rid of them? gsub() it is. I've changed your formula a little bit to better demonstrate what's going on:
form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * M')
gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
[1] "~ A + C / D + E + E / F * G + H + I + J + K" That regular expression will take out a space + any capital letter space * space any capital letter. It will take out all occurrences of that sequence, but won't take out occurrences of * not in that sequence. If you don't want the spaces, you don't need them. Just take them out of the regular expression as well. Not that strsplit() was remotely the right tool here, but you can split into characters without a separator:
form <- 'abcd' strsplit(form, '')
[[1]] [1] "a" "b" "c" "d" Sarah
So, far I have come up with the following code:
parts<-strsplit(form,' ')
index<-which(unlist(parts)=="*")
for (i in 1:length(index)){
? ?parts[[1]][index[i]]<-list(NULL)
? ?parts[[1]][index[i]+1]<-list(NULL)
? ?parts[[1]][index[i]-1]<-list(NULL)
? ?parts[[1]][index[i]-2]<-list(NULL)
}
new.form<-unlist(parts)
form<-new.form[0]
for (i in 1: length(new.form)){
? ?form<-paste(form,new.form[i], sep="")
}
However, as you can see, I have had to use strsplit in, what I consider
a
rather clumsy manner, as the character string (form) has to be in a
certain
format. All variables and maths operators require a space between them
in
order for strsplit to work in the manner I require.
I would very much like to accomplish what the above code already does,
but
without the need for the initial character string having the need for
the
aforementioned spaces.
If the list can offer help, I would be most appreciative.
Yours
Mike Griffiths
-- Sarah Goslee http://www.functionaldiversity.org
On Nov 14, 2011, at 4:20 AM, Michael Griffiths wrote:
Good morning R list,
My apologies if this has *already* answered elsewhere, but I have
not found
the answer that I am looking for.
I have a character string, i.e.
form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')
Now, my aim is to find the position of all those instances of '*'
and to
remove said '*'. However, I would also like to remove the preceding
variable name before the '*', the math operator preceding this, and
also
the variable name after the '*'. So, here I would like to remove
'+L*M'
This would be a very narrow implementation that requires the +/spc/
alnum/spc/*/alnum sequence exactly;
> sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]*", "", form)
[1] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "
This is a more general implementation using the "*" operator that
matches each of the preceding item 0 or more times.
form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M',
'~ A + B + C + C / D + E + E / F + G + H + I + J + K + L*M',
'~ A + B + C + C / D + E + E / F + G + H + I + J + K +Llll*M'
)
> sub("\\+*\\s*[[:alnum:]]*\\s*\\*.[[:alnum:]]*", "", form)
[1] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "
[2] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "
[3] "~ A + B + C + C / D + E + E / F + G + H + I + J + K "
---stripped out code---
David Winsemius, MD West Hartford, CT