Skip to content
Prev 294229 / 398502 Next

Help with stemDocument

Alekseiy, I tried your recommendation with several variations. It still does
not run.  I think the problem has to do with R2.15 and the refreshed TM
package.  Everything runs under R2.10 with the following code:

a <- Corpus(VectorSource(df$text)) # create corpus object
a <- tm_map(a, removePunctuation)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stripWhitespace)		
a <- tm_map(a, stemDocument, language = "english") 


This same code ran on R2.15 results in:
1. the removeWords working sometimes, and sometimes not.
2. and stemDocuments absolutely not working.  

Both error out.  removeWords always stops reading in the stopword list on
the same line number  (I have added and subtracted words - no difference) -
session info is below:
Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  : 
  invalid regular expression
'\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he
Error in .jnew(name) : java.lang.ClassNotFoundException

SessionInfo:
R version 2.15.0 (2012-03-30)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
 [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8   
 [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1        twitteR_0.99.19  
[13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1   

loaded via a namespace (and not attached):
 [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17       
 [5] memoise_0.1        munsell_0.3        proto_0.3-9.2     
RColorBrewer_1.0-5
 [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1  scales_0.2.0
Hi Triss, 

If you need to stem just one text in the Corupus use a[[n]]<-stemDocument

Best,
-Alex