R 2.10.0: Error in gsub/calloc
Bert, Thanks for the tip. Yes, strsplit works, and works fast! For me, white-space tokenization means splitting at the white spaces, so the "^" and the outermost square brackets should/can be omitted. Regards ... from Basel to South San Francisco, Richard
On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
Try: tokens <- strsplit(d,"[^[:space:]]+") This splits each "sentence" in your vector into a vector of groups of whitespace characters that you can then play with as you described, I think (The results is a list of such vectors -- see strsplit()). ## example:
x <- "xx xdfg; *&^%kk "
strsplit(x,"[^[:blank:]]+")
[[1]] [1] "" " " " " " " You might have to use PERL = TRUE and "\\w+" depending on your locale and what "[:space:]" does there. If this works, it should be way faster than strapply() and should not have any memory allocation issues either. HTH. Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org ] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 11:32 AM To: Uwe Ligges Cc: r-help at r-project.org Subject: Re: [R] R 2.10.0: Error in gsub/calloc I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.