Skip to content

Unicode Text Segmentation Algorithms already implemented in R?

2 messages · Sascha Wolfer, Ista Zahn

#
Hello list members,

I am looking for an implementation of Unicode text segmentation (word boundary detection) algorithms in R. You can find information about the algorithms here: http://www.unicode.org/reports/tr29/#Word_Boundaries

The help page for the function ?casefuns? from the excellent ?Unicode? package says: "Other methods will be added eventually (once the Unicode text segmentation algorithm is implemented for detecting word boundaries).? My simple question is: Are these algorithms already implemented in an R package? I didn?t find anything on the web, but I am counting on the power of this list. My Stata-using colleague is already picking at me? (in Stata, the function ?ustrword? does exactly what I want to do in R).

Thanks for your help, have a good day, you all!
Sascha W.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20160303/95daccff/attachment.bin>
#
You searched, but did not tell us what you found, nor why it was unsuitable
for you undescribed use case. So all we can do is guess: my guess is
http://docs.rexamine.com/R-man/stringi/stringi-search-boundaries.html

Best,
Ista
On Mar 3, 2016 8:14 AM, "Sascha Wolfer" <wolfer at ids-mannheim.de> wrote: