Skip to content
Prev 394272 / 398500 Next

Regex Split?

Dear Avi,

Punctuation marks are used in various NLP language models. Preserving 
the "," is therefore useful in such scenarios and Regex are useful to 
accomplish this (especially if you have sufficient experience with such 
expressions).

I observed only an odd behaviour using strsplit: the example string is 
constructed; but it is always wise to test a Regex expression against 
various scenarios. It is usually hard to predict what special cases will 
occur in a specific corpus.

strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
# "a"? "bc"? ","? "def"? ","? ""? "adef"? ","? ","? "gh"

stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?=,)|(?<=,)(?![ ])")
# "a"??? "bc"?? ","??? "def"? ","??? "adef"? ""???? ","??? "," "gh"

stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<! 
)(?=,)|(?<=,)(?![ ])")
# "a"??? "bc"?? ","??? "def"? ","??? "adef"? ","??? ","??? "gh"

# Expected:
# "a"? "bc" ? ","? "def" ? ","? "adef"? "," ? ","? "gh"
# see 2nd instance of stringi::stri_split


Sincerely,


Leonard
On 5/5/2023 11:20 PM, avi.e.gross at gmail.com wrote: