package "tm" fails to remove "the" with remove stopwords
On Thu, Nov 12, 2009 at 11:29:50AM -0500, Mark Kimpel wrote:
I am using code that previously worked to remove stopwords using package "tm".
Thanks for reporting. This is a bug in the removeWords() function in tm version 0.5-1 available from CRAN:
require(tm)
myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water")
text.corp <- Corpus(VectorSource(myDocument))
#########################
text.corp <- tm_map(text.corp, stripWhitespace)
text.corp <- tm_map(text.corp, removeNumbers)
text.corp <- tm_map(text.corp, removePunctuation)
## text.corp <- tm_map(text.corp, stemDocument)
text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english")))
dtm <- DocumentTermMatrix(text.corp)
dtm
dtm.mat <- as.matrix(dtm)
dtm.mat
dtm.mat
Terms Docs falls fetch hill jack jill mainly pail plain rain ran spain the water 1 0 0 0 0 0 0 0 0 1 0 1 1 0 2 1 0 0 0 0 1 0 1 0 0 0 0 0 3 0 0 1 1 1 0 0 0 0 1 0 0 0 4 0 1 0 0 0 0 1 0 0 0 0 0 1
The function removeWords() fails to remove patterns at the beginning or at the end of a line. This bug is fixed in the latest development version on R-Forge, and the fix will be included in the next CRAN release. Please see https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/pkg/inst/NEWS?root=tm&view=markup for a list of all bug fixes and changes between each tm version. Best regards, Ingo Feinerer
Ingo Feinerer Vienna University of Technology http://www.dbai.tuwien.ac.at/staff/feinerer