Split String in regex while Keeping Delimiter
On Wed, 12 Apr 2023 08:29:50 +0000
Emily Bakker <emilybakker at outlook.com> wrote:
Some example data: ?leucocyten + gramnegatieve staven +++ grampositieve staven ++? ?leucocyten ? grampositieve coccen +? ? I want to split the strings such that I get the following result: c(?leucocyten +?, ??gramnegatieve staven +++?, ??grampositieve staven ++?) c(?leucocyten ??, ?grampositieve coccen +?) ? I have tried strsplit with a regular expression with a positive lookahead, but I am not able to achieve the results that I want.
It sounds like you need positive look-behind, not look-ahead: split on
spaces only if they _follow_ one to three of '+' or '-'. Unfortunately,
repetition quantifiers like {n,m} or + are not directly supported in
look-behind expressions (nor in Perl itself). As a special case, you
can use \K, where anything to the left of \K is a zero-width positive
match:
x <- c(
'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
'leucocyten - grampositieve coccen +'
)
strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE)
# [[1]]
# [1] "leucocyten +" "gramnegatieve staven +++"
# "grampositieve staven ++"
#
# [[2]]
# [1] "leucocyten -" "grampositieve coccen +"
Best regards, Ivan P.S. It looks like your e-mail client has transformed every quote character into typographically-correct Unicode quotes ?? and every minus into an en dash, which makes it slightly harder to work with your code, since typographically correct Unicode quotes are not R string delimiters. Is it really ? that you'd like to split upon, or is it -?