Skip to content
Prev 45529 / 63421 Next

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

I just wanted to confirm that Milan's suggestion about adding (*UCP) like in 
the example below:

gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "", "?????", perl=TRUE)

solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded 
input files and stop word list in UTF-8, and now stop words are properly 
removed using the suggested syntax:

sme.corpus<-tm_map(sme.corpus,removeWords.PlainTextDocument,stoplist)

where:
removeWords.PlainTextDocument <- function (x, words)
  gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse = "|")), "", x, 
perl=TRUE)

and stoplist is a character vector of stop words.

The wordcloud function now also accept the preprocessed corpus without 
warnings or errors. Now, if only I could do stemming in Bulgarian, that would 
have been priceless!

Thanks again, this has been tremendous help indeed!
Vince
On Wednesday 10 April 2013 20:43:27 Milan Bouchet-Valat wrote: