Is there a simple way in R to remove all characters from a string other than those in a specified set? For example, I want to keep only the digits 0-9 in a string. In general, I have found the string handling abilities of R a bit limited. (Of course it's great for stats in general). Is there a good reference on this? Or should R programmers dump their output to a text file and use something like Perl or Python for sophisticated text processing? I am familiar with the basic functions such as nchar, substring, as.integer, print, cat, sprintf etc.
removing characters from a string
8 messages · Vivek Rao, Marc Schwartz, John Fox +5 more
On Tue, 2005-04-12 at 05:54 -0700, Vivek Rao wrote:
Is there a simple way in R to remove all characters from a string other than those in a specified set? For example, I want to keep only the digits 0-9 in a string. In general, I have found the string handling abilities of R a bit limited. (Of course it's great for stats in general). Is there a good reference on this? Or should R programmers dump their output to a text file and use something like Perl or Python for sophisticated text processing? I am familiar with the basic functions such as nchar, substring, as.integer, print, cat, sprintf etc.
Something like the following should work:
x <- paste(sample(c(letters, LETTERS, 0:9), 50, replace = TRUE),
collapse = "")
x
[1] "QvuuAlSJYUFpUpwJomtCir8TfvNQyV6O7W7TlXSXlLHocCdtnV"
gsub("[^0-9]", "", x)
[1] "8677" The use of gsub() here replaces any characters NOT in 0:9 with a "", therefore leaving only the digits. See ?gsub for more information. HTH, Marc Schwartz
Dear Vivek,
Actually, I think R has reasonably good facilities for manipulating strings.
See ?gsub etc.; for example:
gsub("[^0-9]", "", "XKa0&*1jk2")
[1] "012"
I hope this helps,
John
--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Vivek Rao Sent: Tuesday, April 12, 2005 7:55 AM To: r-help at stat.math.ethz.ch Subject: [R] removing characters from a string Is there a simple way in R to remove all characters from a string other than those in a specified set? For example, I want to keep only the digits 0-9 in a string. In general, I have found the string handling abilities of R a bit limited. (Of course it's great for stats in general). Is there a good reference on this? Or should R programmers dump their output to a text file and use something like Perl or Python for sophisticated text processing? I am familiar with the basic functions such as nchar, substring, as.integer, print, cat, sprintf etc.
"Vivek" == Vivek Rao <rvivekrao at yahoo.com>
on Tue, 12 Apr 2005 05:54:55 -0700 (PDT) writes:
Vivek> Is there a simple way in R to remove all characters
Vivek> from a string other than those in a specified set? For
Vivek> example, I want to keep only the digits 0-9 in a
Vivek> string.
Vivek> In general, I have found the string handling abilities
Vivek> of R a bit limited. (Of course it's great for stats in
Vivek> general). Is there a good reference on this? Or should
Vivek> R programmers dump their output to a text file and use
Vivek> something like Perl or Python for sophisticated text
Vivek> processing?
Vivek> I am familiar with the basic functions such as nchar,
Vivek> substring, as.integer, print, cat, sprintf etc.
It depends on your "etc":
The above is pretty trivial using gsub(),
but since you sound sophisticated enough to proclaim missing R
abilities, I leave the exercise to you.
Martin
look at "?gsub()", e.g.,
string <- "ab03def10-523rtf"
string
gsub("[^0-9]", "", string)
gsub("[0-9]", "", string)
I hope it helps.
Best,
Dimitris
----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat/
http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
----- Original Message -----
From: "Vivek Rao" <rvivekrao at yahoo.com>
To: <r-help at stat.math.ethz.ch>
Sent: Tuesday, April 12, 2005 2:54 PM
Subject: [R] removing characters from a string
Is there a simple way in R to remove all characters from a string other than those in a specified set? For example, I want to keep only the digits 0-9 in a string. In general, I have found the string handling abilities of R a bit limited. (Of course it's great for stats in general). Is there a good reference on this? Or should R programmers dump their output to a text file and use something like Perl or Python for sophisticated text processing? I am familiar with the basic functions such as nchar, substring, as.integer, print, cat, sprintf etc.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Hi
Try
gsub("[^0-9]","","1111af-456utaDFasswe34534%^&%*$h567890ersdfg")
[1] "111145634534567890"
HTH
rksh
On Apr 12, 2005, at 01:54 pm, Vivek Rao wrote:
Is there a simple way in R to remove all characters from a string other than those in a specified set? For example, I want to keep only the digits 0-9 in a string. In general, I have found the string handling abilities of R a bit limited. (Of course it's great for stats in general). Is there a good reference on this? Or should R programmers dump their output to a text file and use something like Perl or Python for sophisticated text processing? I am familiar with the basic functions such as nchar, substring, as.integer, print, cat, sprintf etc.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-- Robin Hankin Uncertainty Analyst Southampton Oceanography Centre European Way, Southampton SO14 3ZH, UK tel 023-8059-7743
Using help.start() and searching on keyword "character" or using help.search(keyword="character") will show you what you have missed. As others have pointed out, you have missed the power of regular expressions (despite that being how these things are done in Perl). Also, strsplit() can be very powerful.
On Tue, 12 Apr 2005, Vivek Rao wrote:
Is there a simple way in R to remove all characters from a string other than those in a specified set? For example, I want to keep only the digits 0-9 in a string. In general, I have found the string handling abilities of R a bit limited.
Your exploration of them seems more than a bit limited.
(Of course it's great for stats in general). Is there a good reference on this? Or should R programmers dump their output to a text file and use something like Perl or Python for sophisticated text processing? I am familiar with the basic functions such as nchar, substring, as.integer, print, cat, sprintf etc.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Martin Maechler wrote:
"Vivek" == Vivek Rao <rvivekrao at yahoo.com> on Tue, 12 Apr 2005 05:54:55 -0700 (PDT) writes:
Vivek> Is there a simple way in R to remove all characters
Vivek> from a string other than those in a specified set? For
Vivek> example, I want to keep only the digits 0-9 in a
Vivek> string.
Vivek> In general, I have found the string handling abilities
Vivek> of R a bit limited. (Of course it's great for stats in
Vivek> general). Is there a good reference on this? Or should
Vivek> R programmers dump their output to a text file and use
Vivek> something like Perl or Python for sophisticated text
Vivek> processing?
Vivek> I am familiar with the basic functions such as nchar,
Vivek> substring, as.integer, print, cat, sprintf etc.
It depends on your "etc":
The above is pretty trivial using gsub(),
but since you sound sophisticated enough to proclaim missing R
abilities, I leave the exercise to you.
Part of the problem here is our help system. gsub is documented within the grep topic, so when you look at the keyword==character topics, you don't see it explicitly. (You do see "pattern matching and replacement", which should have been a hint.) And if you were looking for "string handling" under the programming category, you're completely out of luck. Another reason some people might see R's string handling as limited is that it is sometimes more cumbersome to manipulate strings in R than in other languages. For example, I vaguely recall that there's a good reason why R doesn't use "+" to concatenate strings, but I can't remember what it is. And sometimes I'd like to strip whitespace or pad things to a given width; I generally need to define my own functions to do that each time. R is capable of concatenation, stripping and padding, but is sometimes a little obscure in how it does them. Duncan Murdoch