Skip to content
Prev 260529 / 398502 Next

Help using "tm" text mining package - preprocessing

Is there a way to use "tm" with the Slovene characters?  I ask, 
because I was hoping to use "tm" with languages like Arabic, Urdu, 
Farsi, and Hebrew.  If you need to translate Slovene characters, it 
could create problems with using the software for the desired purpose in 
many languages, including Russian, which is on the list of languages 
currently supported.


       The package includes two vignettes, the first of which cites two 
2008 papers by Feinerer et al. in the Journal of Statistical Software 
and R News.  Both those papers are freely downloadable.  Have you looked 
at those?


       I have not studied the "tm" documentation carefully, but the 
package includes  a function "stopwords", which returns the "language 
tags" for an indicated language per the Internet Engineering Task Force 
(IETF;  www.ietf.org).  Slovene is not among the languages currently 
supported, but "their IETF language tags may be used."  I have not used 
the package, but you can supply your own list of stopwords for Slovene, 
similar to the following silly example:


 > stopwords<-function(language='duh') 
if(language=='duh')return(c('duh', 'hud')) else tm:::stopwords(language)
 > stopwords('duh')
[1] "duh" "hud"


       This may not be all you need to do to use "tm" with Slovene, but 
it might help you with "stopwords".


       Hope this helps.
       Spencer Graves
On 5/21/2011 5:59 AM, Matev? Pavli? wrote: