-----Original Message-----
From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
Sent: December-04-08 5:05 AM
To: John Fox
Cc: R help
Subject: Re: [R] Strplit code
John Fox wrote:
By coincidence, I have a version of strsplit() that I've used to
illustrate recursion:
Strsplit <- function(x, split){
if (length(x) > 1) {
return(lapply(x, Strsplit, split)) # vectorization
}
result <- character(0)
if (nchar(x) == 0) return(result)
posn <- regexpr(split, x)
if (posn <= 0) return(x)
c(result, substring(x, 1, posn - 1),
Recall(substring(x, posn+1, nchar(x)), split)) # recursion
}
well, it is both inefficient and wrong.
inefficient because of the non-tail recursion and recursive
concatenation, which is justified for the sake the purpose of showing
recursion, but for practical purposes you'd rather use gregexepr.
wrong because of how you pick the remaining part of the string to be
split -- it works just under the assumption the pattern is a single
character:
Strsplit("hello-dolly,--sweet", "--")
# the pattern is *two* hyphens
# [1] "hello-dolly" "-sweet"
Strsplit("hello dolly", "")
# the pattern is the empty string
# [1] "" "" "" "" "" "" "" "" "" "" ""
here's a quick rewrite -- i haven't tested it on extreme cases, it may
not be perfect, and there's a hidden source of inefficiency here as well:
strsplit =
function(strings, split) {
positions = gregexpr(split, strings)
lapply(1:length(strings), function(i)
substring(strings[[i]], c(1, positions[[i]] +
attr(positions[[i]], "match.length")), c(positions[[i]]-1,
nchar(strings[[i]]))))
}
n = 1000; m = 100
strings = replicate(n, paste(sample(c(letters, " "), 100, replace=TRUE),
collapse=""))
system.time(replicate(m, strsplit(strings, " ")))
system.time(replicate(m, Strsplit(strings, " ")))
vQ