Skip to content

Strplit code

8 messages · John Fox, Gabor Grothendieck, Wacek Kusnierczyk

#
Dear Brian (and original poster),

My apologies -- I didn't notice the original posting.

By coincidence, I have a version of strsplit() that I've used to
illustrate recursion:

Strsplit <- function(x, split){
    if (length(x) > 1) {
        return(lapply(x, Strsplit, split))  # vectorization
        }
    result <- character(0)
    if (nchar(x) == 0) return(result)
    posn <- regexpr(split, x)
    if (posn <= 0) return(x)
    c(result, substring(x, 1, posn - 1), 
        Recall(substring(x, posn+1, nchar(x)), split))  # recursion
    }

Illustration:
+              "Whether t'is nobler,in the mind",
+              "to suffer the slings:and arrows:of outrageous fortune",
+              "or to take arms;against;a sea of troubles")
[[1]]
[1] "To be"                "or not to be"         "that is the
question"

[[2]]
[1] "Whether t'is nobler" "in the mind"        

[[3]]
[1] "to suffer the slings"  "and arrows"            "of outrageous
fortune"

[[4]]
[1] "or to take arms"   "against"           "a sea of troubles"



I think that this should work in S-PLUS, though I haven't tried it.

Regards,
 John

On Wed, 3 Dec 2008 20:46:56 +0000 (GMT)
Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/
#
John Fox wrote:
well, it is both inefficient and wrong.

inefficient because of the non-tail recursion and recursive
concatenation, which is justified for the sake the purpose of showing
recursion, but for practical purposes you'd rather use gregexepr.

wrong because of how you pick the remaining part of the string to be
split -- it works just under the assumption the pattern is a single
character:

Strsplit("hello-dolly,--sweet", "--")
# the pattern is *two* hyphens
# [1] "hello-dolly" "-sweet"

Strsplit("hello dolly", "")
# the pattern is the empty string
#  [1] "" "" "" "" "" "" "" "" "" "" ""


here's a quick rewrite -- i haven't tested it on extreme cases, it may
not be perfect, and there's a hidden source of inefficiency here as well:

strsplit =
function(strings, split) {
    positions = gregexpr(split, strings)
    lapply(1:length(strings), function(i)
        substring(strings[[i]], c(1, positions[[i]] +
attr(positions[[i]], "match.length")), c(positions[[i]]-1,
nchar(strings[[i]]))))
}


n = 1000; m = 100
strings = replicate(n, paste(sample(c(letters, " "), 100, replace=TRUE),
collapse=""))
system.time(replicate(m, strsplit(strings, " ")))
system.time(replicate(m, Strsplit(strings, " ")))


vQ
#
Dear Wacek,

"Wrong" is a bit strong, I think -- limited to single-pattern characters is
more accurate. Moreover, it isn't hard to make the function work with
multiple-character matches as well:

Strsplit <- function(x, split){
    if (length(x) > 1) {
        return(lapply(x, Strsplit, split))  # vectorization
        }
    result <- character(0)
    if (nchar(x) == 0) return(result)
    posn <- regexpr(split, x)
    if (posn <= 0) return(x)
    c(result, substring(x, 1, posn - 1), 
        Recall(substring(x, posn + attr(posn, "match.length"), 
          nchar(x)), split))  # recursion
    }

On the other hand, your function is much more efficient.

Regards,
 John 

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox
#
John Fox wrote:
nothing is ever wrong if seen from an appropriate perspective.  for
example, there is nothing wrong in that many core functions in r deparse
some, but not all, of the argument expressions, without any obvious
pattern -- when you get used to it and learn each single case by heart,
it's perfectly correct.
which you probably should have done before posting the flawed version.
just one order of magnitude in my tests.  might not be completely fool
proof, though. 

vQ
#
R does not support tail recursion so not using it would
not matter.

On Thu, Dec 4, 2008 at 5:04 AM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
#
Gabor Grothendieck wrote:
this is a feature, i guess.

another issue is that the recursive calls receive substrings as
arguments, which means copying the whole remaining content, and the
copies would not be deallocated until after the recursive calls return,
am i right?

vQ
#
Dear Wacek,

I've thought a bit more about this problem, and recall that I originally
wrote Strsplit() [and replacements for sub() and gsub(), which were not then
in S-PLUS] for the version of the car package that I released for S-PLUS,
because other functions in the package used these. The strings involved were
small, so performance issues weren't that important, although of course it's
better to have a more efficient solution.

Although I no longer have an installed copy of S-PLUS to confirm this, I
believe that gregexepr() is still not present in S-PLUS (though I think that
strsplit() is in the latest version). If that's the case, then your function
wouldn't work at all in the context of the original posting, which asked for
a solution in S-PLUS. You could make your code work in S-PLUS, and probably
still have it more efficient than mine, by writing a replacement for
gregexpr().
is
Indeed. Had I anticipated the possibility of multiple-character splits I
would have done so.

John
#
John Fox wrote:
right.  the speedup is not due to any substantial algorithmic
difference, but rather in that your code is r code, while mine uses
gregexpr, which is i assume is precompiled from c code or the like.

about 'wrong' and 'flawed', again:  what i meant is that as a suggestion
for how strsplit, in general, could be written, it doesn't meet the
challenge, and should at least issue a warning if the split pattern
specifies non-1-length splits.  otherwise, it could be perfectly fit for
a student's exercise.
possibly, and then it's my code that is wrong and flawed ;)
i haven't used s-plus for ages, jumping to the global frame instead of
lexical scoping scared me away.

vQ