Skip to content

strsplit question

6 messages · Erin Hodgess, Remko Duursma, Joshua Wiley +3 more

#
Dear R People:

I have the following set of data
[1] "5600-5699" "6100-6199" "9700-9799" "9400-9499" "8300-8399"

and I want to split at the -
[[1]]
[1] "5600" "5699"

[[2]]
[1] "6100" "6199"

[[3]]
[1] "9700" "9799"

[[4]]
[1] "9400" "9499"

[[5]]
[1] "8300" "8399"
What is the best way to extract the pieces that are to the left of the
dash, please?

Thanks,
Erin
#
unlist(strsplit(Block[1:5], "-.+$"))

if you are going to want the other pieces later, the most efficient
way depends on the assumptions you can make about your data.  If there
are always two elements from the split:

matrix(unlist(strsplit(Block[1:5], "-")), ncol = 2, byrow = TRUE)
## or
do.call("rbind", strsplit(Block[1:5], "-"))

the first option dropping everything after - is marginally more
efficient, followed by the matrix technique.  A series of clunkier
options (in my view) would be:

unlist(strsplit(Block[1:5], "-"))[seq(from = 1, to = 2 *
length(Block[1:5]), by = 2)]

or very flexible in terms of extracting the first element (regardless
of how many there are), but computationally less efficient:

sapply(strsplit(Block[1:5], "-"), `[[`, 1)

but this is only slightly less so, and testing on a simple character
vector of length 10^8, was still complete in less than 1 second on a
1.66ghz dual core on R devel r57214 windows x64.

Cheers,

Josh
On Tue, Oct 11, 2011 at 10:20 PM, Erin Hodgess <erinm.hodgess at gmail.com> wrote:

  
    
#
On Oct 12, 2011, at 1:20 AM, Erin Hodgess wrote:

            
> sub("\\-.*$", "", c("5600-5699", "6100-6199", "9700-9799",  
"9400-9499", "8300-8399") )
[1] "5600" "6100" "9700" "9400" "8300"
#
On Wed, Oct 12, 2011 at 1:20 AM, Erin Hodgess <erinm.hodgess at gmail.com> wrote:
Try this:
[1] "5600" "6100" "9700" "9400" "8300"
[1] "5699" "6199" "9799" "9499" "8399"

and here is another approach:
[,1]   [,2]   [,3]   [,4]   [,5]
[1,] "5600" "6100" "9700" "9400" "8300"
[2,] "5699" "6199" "9799" "9499" "8399"

Now m[1, ] and m[2, ] are the vectors of digits before and after the
dash.  Note that c in the strapply call can be replaced with
as.numeric if you want a numeric matrix instead.