I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit
mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM.
I am encountering the following error in RWeka:
Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize",
.jcast(tokenizer, : java.lang.StringIndexOutOfBoundsException:
String index out of range: 1
Here is the code that is causing the problem:
library(rJava)
(.jinit(parameters = "-Xmx3000m"))
library(RWeka)
wctrl <- Weka_control(min = 1, max = 4)
lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl)
lseg is a list of 965193 sentences, each of which consists of one or
more segments. For example, lseg[[1]] is
[[1]]
[1] "calculation of results xxxx activity is defined as the increase
in radioactivity " [2] "in dpm"
[3] "in the pellet "
[4] "xxx"
[5] ""
[6] "caused by the addition of xx xxxx"
lapply should build 1-, 2-, 3- and 4-grams of each sentence segment.
Is there any way to solve or circumvent the error? In Java
Preferences on the Mac I have specified for applications Java SE 6
64- bit, then J2SE 5.0 64-bit, before other 32-bit versions.
(Side remark: I'm surprised that it only does this for the first
and last segments of the first sentence. Admittedly, the other
segments have less than 4 grams, but that should not stop it from
building n- grams consisting of fewer grams!)
Thanks,
Richard
------
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland
Tel.: +41 61 331 10 47
Email: richard.liu at pueo-owl.ch