Skip to content

string-to-number

11 messages · Charles Annis, P.E., Marc Schwartz, Peter Dalgaard +4 more

#
Greetings, Amigos:

I have been trying without success to convert a character string,
[1] "3,6,10"

into c(3,6,10) for subsequent use.

as.numeric(repeated.measures.columns) doesn't work (likely because of the
commas)
[1] NA
Warning message:
NAs introduced by coercion

I've tried many things including 
strsplit(repeated.measures.columns, split = ",")

which produces a list with only one element, viz:
[[1]]
[1] "3"  "6"  "10"

as.numeric() doesn't like that either.

Clearly: 1) I cannot be the first person to attempt this, and 2) I've made
this WAY harder than it is.

Would some kind soul please instruct me (and perhaps subsequent searchers)
how to convert the elements of a string into numbers?

Thank you.


Charles Annis, P.E.

Charles.Annis at StatisticalEngineering.com
phone: 561-352-9699
eFax:? 614-455-3265
http://www.StatisticalEngineering.com
?
#
On Sat, 2006-08-19 at 07:58 -0400, Charles Annis, P.E. wrote:
One more step:
[1]  3  6 10

Use unlist() to take the output of strsplit() and convert it to a
vector, before coercing to numeric.

HTH,

Marc Schwartz
#
"Charles Annis, P.E." <Charles.Annis at statisticalengineering.com> writes:
3) you're almost there, just not realizing it:
[1]  3  6 10

or for that matter
Read 3 items
[1]  3  6 10

although that leaves you with a dangling open connection.
#
On Sat, 19 Aug 2006, Charles Annis, P.E. wrote:

            
repeated.measures.columns is a vector. Consider:

repeated.measures.columns <- c("3,6,10", "5,4,9")
lst <- strsplit(repeated.measures.columns, split = ",")
lapply(lst, as.numeric)

which is why strsplit() returns a list - one list component for each 
repeated.measures.columns element. Just pick off the one you want with 
[[]]:

as.numeric(strsplit(repeated.measures.columns, split = ",")[[1]])

  
    
#
On Sat, 19 Aug 2006, Marc Schwartz wrote:

            
Or, more simply, use [[1]] as in

as.numeric(strsplit(repeated.measures.columns, ",")[[1]])

Also,

eval(parse(text=paste("c(", repeated.measures.columns, ")")))

looks competitive, and is quite a bit more general (e.g. allows spaces, 
works with complex numbers), or you can use scan() from an anonymous file 
or a textConnection.
#
On Sat, 2006-08-19 at 13:30 +0100, Prof Brian Ripley wrote:
I would say more than competitive:

  repeated.measures.columns <- paste(1:100000, collapse = ",")
chr
"1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,4"| __truncated__
as.numeric(unlist(strsplit(repeated.measures.columns, ","))))
[1] 24.238  0.192 26.200  0.000  0.000
",")[[1]]))
[1] 24.313  0.196 26.471  0.000  0.000
repeated.measures.columns, ")"))))
[1] 0.328 0.004 0.395 0.000 0.000
num [1:100000] 1 2 3 4 5 6 7 8 9 10 ...
num [1:100000] 1 2 3 4 5 6 7 8 9 10 ...
num [1:100000] 1 2 3 4 5 6 7 8 9 10 ...
[1] TRUE
[1] TRUE


Best regards,

Marc
#
Much gratitude to Professor Ripley, Peter Dalgaard, Marc Schwartz, and Roger
Bivand. 
__________________

Roger Bivand wrote that ... strsplit() returns a list - one list component
for each repeated.measures.columns element. Just pick off the one you want
with
[[]]:
as.numeric(strsplit(repeated.measures.columns, split = ",")[[1]])

which had stumped me, since that syntax fails without the [[1]]
specification.
__________________
Peter Dalgaard, who also suggested the [[1]] specification, pointed out that

scan(textConnection(x), sep=",")

will work, although that leaves you with a dangling open connection.
__________________
Marc Schwartz advised to ...
Use unlist() to take the output of strsplit() and convert it to a vector,
before coercing to numeric.

as.numeric(unlist(strsplit(repeated.measures.columns, ",")))
____________________________________
Brian D. Ripley suggested that the following looks competitive, and is quite
a bit more general (e.g. allows spaces, works with complex numbers)
 
eval(parse(text=paste("c(", repeated.measures.columns, ")")))

and Marc Schwartz showed that Professor Ripley's suggestion is much faster
than the competition with some system.time trials.
____________________________________

Many thanks to all.
 

Charles Annis, P.E.

Charles.Annis at StatisticalEngineering.com
phone: 561-352-9699
eFax:  614-455-3265
http://www.StatisticalEngineering.com
 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Charles Annis, P.E.
Sent: Saturday, August 19, 2006 7:59 AM
To: r-help at stat.math.ethz.ch
Subject: [R] string-to-number

Greetings, Amigos:

I have been trying without success to convert a character string,
[1] "3,6,10"

into c(3,6,10) for subsequent use.

as.numeric(repeated.measures.columns) doesn't work (likely because of the
commas)
[1] NA
Warning message:
NAs introduced by coercion

I've tried many things including 
strsplit(repeated.measures.columns, split = ",")

which produces a list with only one element, viz:
[[1]]
[1] "3"  "6"  "10"

as.numeric() doesn't like that either.

Clearly: 1) I cannot be the first person to attempt this, and 2) I've made
this WAY harder than it is.

Would some kind soul please instruct me (and perhaps subsequent searchers)
how to convert the elements of a string into numbers?

Thank you.


Charles Annis, P.E.

Charles.Annis at StatisticalEngineering.com
phone: 561-352-9699
eFax:? 614-455-3265
http://www.StatisticalEngineering.com
?

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On 8/19/06, Charles Annis, P.E.
<Charles.Annis at statisticalengineering.com> wrote:
You do this:

scan(textConnection(x), sep = ",")
closeAllConnections()

Now the following shows that none are open:

showConnections()

You could alternately explicitly close it:

scan(con <- textConnection(x), sep = ",")
close(con)
#
Wow.  New respect for parse/eval.

Do you think this is a special case of a more general principle?  I
suppose the cost is memory, but from time to time a speedup like this
would be very beneficial.

Any hints about how R programmers could recognize such cases would, I
am sure, be of value to the list in general.

Many thanks for your efforts, Marc!

Regards,

Mike
On 8/19/06, Marc Schwartz <MSchwartz at mn.rr.com> wrote:

  
    
#
On Sat, 2006-08-19 at 10:25 -0600, Mike Nielsen wrote:
Mike,

I think that one needs to consider where the time is being spent and
then adjust accordingly. Once you understand that, you can develop some
insight into what may be a more efficient approach. R provides good
profiling tools that facilitate this process.

In this case, almost all of the time in the first two examples using
strsplit(), is in that function:
$by.self
                    self.time self.pct total.time total.pct
"strsplit"              23.68     99.7      23.68      99.7
"as.double.default"      0.06      0.3       0.06       0.3
"as.numeric"             0.00      0.0      23.74     100.0
"unlist"                 0.00      0.0      23.68      99.7

$by.total
                    total.time total.pct self.time self.pct
"as.numeric"             23.74     100.0      0.00      0.0
"strsplit"               23.68      99.7     23.68     99.7
"unlist"                 23.68      99.7      0.00      0.0
"as.double.default"       0.06       0.3      0.06      0.3

$sampling.time
[1] 23.74


Contrast that with Prof. Ripley's approach:
$by.self
        self.time self.pct total.time total.pct
"parse"      0.42     87.5       0.42      87.5
"eval"       0.06     12.5       0.48     100.0

$by.total
        total.time total.pct self.time self.pct
"eval"        0.48     100.0      0.06     12.5
"parse"       0.42      87.5      0.42     87.5

$sampling.time
[1] 0.48


To some extent, one could argue that my initial timing examples are
contrived, in that they specifically demonstrate a worst case scenario
using strsplit().  Real world examples may or may not show such gains.

For example with Charles' initial query, the initial vector was rather
short:

  > repeated.measures.columns
  [1] "3,6,10"

So if this was a one-time conversion, we would not see such significant
gains.

However, what if we had a long list of shorter entries:
[1] "1,2,3,4,5,6,7,8,9,10"
[[1]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[2]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[3]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[4]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[5]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[6]]
[1] "1,2,3,4,5,6,7,8,9,10"
as.numeric(unlist(strsplit(x, ","))))))
[1] 1.972 0.044 2.411 0.000 0.000
num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]    1    2    3    4    5    6    7    8    9    10
[3,]    1    2    3    4    5    6    7    8    9    10
[4,]    1    2    3    4    5    6    7    8    9    10
[5,]    1    2    3    4    5    6    7    8    9    10
[6,]    1    2    3    4    5    6    7    8    9    10



Now use Prof. Ripley's approach:
eval(parse(text=paste("c(", x, ")"))))))
[1] 1.676 0.012 1.877 0.000 0.000
num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]    1    2    3    4    5    6    7    8    9    10
[3,]    1    2    3    4    5    6    7    8    9    10
[4,]    1    2    3    4    5    6    7    8    9    10
[5,]    1    2    3    4    5    6    7    8    9    10
[6,]    1    2    3    4    5    6    7    8    9    10
[1] TRUE


We do see a notable reduction in time with strsplit(), while a notable
increase in time using eval(parse)), even though we are converting the
same net number of values (100,000).

Much of the increase with eval(parse()) is of course due to the overhead
of sapply() and navigating the list.


Let's increase the size of the list components to 1000:
as.numeric(unlist(strsplit(x, ",")))))
[1] 33.270  0.744 37.163  0.000  0.000
eval(parse(text=paste("c(", x, ")"))))))
[1] 15.893  0.928 18.139  0.000  0.000


So we see here that as the size of the list components increases, there
continues to be an advantage to Prof. Ripley's approach over using
strsplit().

Again, one needs to develop an understanding of where the time is spent
in the processing by profiling and then consider how to introduce
efficiencies, which in some cases may very well require the use of
compiled C/FORTRAN as may be appropriate if times become too long.

HTH,

Marc Schwartz
#
Marc,

Thanks very much for this.  I hadn't really looked at Rprof in the
past; now I have a new toy to play with!

I have formulated an hypothesis that the reason parse/eval is quicker
lies in the pattern-matching code:  strsplit is using regular
expressions, whereas perhaps parse is using some more clever (but
possibly less general) matching algorithm.  It will be interesting to
inspect the source code to get to the bottom of it.

Thanks again for your interest and efforts in this, and for pointing out Rprof!

Regards,

Mike Nielsen
On 8/20/06, Marc Schwartz <MSchwartz at mn.rr.com> wrote: