Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Wed, Apr 10, 2013 1:31 PM

I just wanted to confirm that Milan's suggestion about adding (*UCP) like in 
the example below:

gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "", "?????", perl=TRUE)

solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded 
input files and stop word list in UTF-8, and now stop words are properly 
removed using the suggested syntax:

sme.corpus<-tm_map(sme.corpus,removeWords.PlainTextDocument,stoplist)

where:
removeWords.PlainTextDocument <- function (x, words)
  gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse = "|")), "", x, 
perl=TRUE)

and stoplist is a character vector of stop words.

The wordcloud function now also accept the preprocessed corpus without 
warnings or errors. Now, if only I could do stemming in Bulgarian, that would 
have been priceless!

Thanks again, this has been tremendous help indeed!
Vince

On Wednesday 10 April 2013 20:43:27 Milan Bouchet-Valat wrote:

Le mercredi 10 avril 2013 ? 13:17 +0200, Ingo Feinerer a ?crit :

On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote:

Thanks for the reproducible example. Indeed, it does not work here
either (Linux with UTF-8 locale). The problem seems to be in the call to
gsub() in removeWords: the pattern "\\b" does not match anything when
perl=TRUE. With perl=FALSE, it works.

The \b versus perl versus UTF-8 issue seems to be known, and it is
advised to use perl = TRUE with \b. See e.g. the warning in the gsub
help page (?gsub):

---8<---------------------------------------------------------------------
----- Warning:

POSIX 1003.2 mode of ?gsub? and ?gregexpr? does not work correctly with
repeated word-boundaries (e.g. ?pattern = "\b"?).  Use ?perl = TRUE? for
such matches (but that may not work as expected with non-ASCII inputs,
as the meaning of ?word? is system-dependent).
---8<---------------------------------------------------------------------
-----

Thanks for the pointer. Indeed, this allowed me to discover the
existence of the PCRE_UCP (Unicode Character Properties) flag, which
changes matching behavior so that Unicode alphanumerics are not
considered as word boundaries.

This flag should probably be used by R when calling pcre_compile() in
gsub() and friends. At the moment, R's behavior is inconsistent across
platforms:
- on Fedora 18, R 2.15.3 :
gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "?l?gramme"

- on Windows 2008, R 2.15.1 and 3.0.0 :
gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "t?l?gramme"


Luckily, the bug can be fixed at tm's level by adding (*UCP) at the

beginning of the pattern. This works for our examples :

gsub(sprintf("\\b(%s)\\b", "?????"), "", "?????", perl=TRUE)

[1] "?????"

gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "", "?????", perl=TRUE)

[1] ""

gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "?l?gramme"
gsub("(*UCP)\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "t?l?gramme"


Regards

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Thread (2 messages)