Skip to content
Prev 321342 / 398500 Next

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Hi,

Thanks for taking the time. Here is a more reproducible example of the 
entire process:

# Creating a vector source - stupid text in the Bulgarian language
bg<-c('???? ? ????? ? ??????? ???, ? ????? ?????? ????? ?? ????? 
?????.','???? ?? ???? ??? ??-????? ???.')

# Converting strings from the vector source to UTF-8. Without this step
# in my setup, I don't see Cyrillic letters, even if I set the default
# code page to CP1251.
bg<-iconv(bg,to='UTF-8')

# Load the tm library
library(tm)

# Create the corpus from the vector source
corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))

# Create a custom stop list based on the example vector source
# Converting to UTF-8
stoplist<-c('?','?','?','?????','??????','??','?????','?????','??','????','???')
stoplist<-iconv(stoplist,to='UTF-8')

# Preprocessing
corp<-tm_map(corp,removePunctuation)
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,tolower)
corp<-tm_map(corp,removeWords,stoplist)

# End of code here

Now, if I run inspect(corp), I still see all the stop words intact 
inside the corpus. I can't figure out why. I tried experimenting with 
file encodings, with and without explicit statements of encoding, and it 
never works. As far as I can tell, my code is not wrong, and the 
function stopwords('language') returns a character vector, so just 
replacing it by a different character vector should do the trick. Alas, 
no list of stop words for Bulgarian language is available as part of the 
tm package (not surprisingly).

In the above example, I also tried to read in the list of stop words 
from a file using the scan function, per the example in my original 
message. It also fails to remove stop words, without any warnings or 
error messages.

An alternative I tried was to convert to a term-document matrix, and 
then loop through the words inside and remove those that are also on the 
stop list. That only partially works for two reasons. The TDM is 
actually a list, and I am not sure what code I need to use if I delete 
words, but do not update the underlying indeces. And second, some of the 
words still don't get removed even though they are in the list. But that 
is another issue altogether...

Thanks for your attention and for your help!
Vince
On 9.4.2013 ?. 22:55 ?., Milan Bouchet-Valat wrote: