I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error: Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
R 2.10.0: Error in gsub/calloc
14 messages · richard.liu at pueo-owl.ch, Uwe Ligges, Richard R. Liu +5 more
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch
On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Try the patch version... Maybe is the same problem I had with large database when using gsub() HTH El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribi?:
I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--Apple-Mail-8--203371287--
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Try: tokens <- strsplit(d,"[^[:space:]]+") This splits each "sentence" in your vector into a vector of groups of whitespace characters that you can then play with as you described, I think (The results is a list of such vectors -- see strsplit()). ## example:
x <- "xx xdfg; *&^%kk "
strsplit(x,"[^[:blank:]]+")
[[1]] [1] "" " " " " " " You might have to use PERL = TRUE and "\\w+" depending on your locale and what "[:space:]" does there. If this works, it should be way faster than strapply() and should not have any memory allocation issues either. HTH. Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 11:32 AM To: Uwe Ligges Cc: r-help at r-project.org Subject: Re: [R] R 2.10.0: Error in gsub/calloc I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch
On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Kenneth, Thanks for the hint. I downloaded and installed the latest patch, but to no avail. I can reproduce the error on a single sentence, the longest in the document. It contains 743,393 characters. It isn't a true sentence, but since it is more than three standard deviations longer than the mean sentence length, I might be able to use the mean and the standard deviation as a way of weeding ot the really evident "non-sentences" before I take into account the characteristics of the the tokens. Regards, Richard
On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
Try the patch version... Maybe is the same problem I had with large database when using gsub() HTH El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribi?:
I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--Apple-Mail-8--203371287--
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Here is a more self-contained way to reproduce the problem in 2.10.0 using the prebuilt Windows executable. Putting a trace on gsub in the call to strapply showed that it died in the first call to gsub when the replacement included "\\1" and the string was about 900000 characters long (and included 150000 "words"). It looks like it dies if the string is >= 731248 characters.
d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 731248) nchar(d)
[1] 731248
substring(d, nchar(d)-10)
[1] " abcde abcd"
p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Reached total allocation of 1535Mb: see help(memory.size)
p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Reached total allocation of 1535Mb: see help(memory.size)
Make d one character shorter and it succeeds with either
perl=TRUE or perl=FALSE.
version
_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 10.0 year 2009 month 10 day 26 svn rev 50208 language R version.string R version 2.10.0 (2009-10-26)
sessionInfo()
R version 2.10.0 (2009-10-26) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tcltk_2.10.0 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 3:00 PM To: Kenneth Roy Cabrera Torres Cc: r-help at r-project.org; Uwe Ligges Subject: Re: [R] R 2.10.0: Error in gsub/calloc Kenneth, Thanks for the hint. I downloaded and installed the latest patch, but to no avail. I can reproduce the error on a single sentence, the longest in the document. It contains 743,393 characters. It isn't a true sentence, but since it is more than three standard deviations longer than the mean sentence length, I might be able to use the mean and the standard deviation as a way of weeding ot the really evident "non-sentences" before I take into account the characteristics of the the tokens. Regards, Richard On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
Try the patch version... Maybe is the same problem I had with large database when using gsub() HTH El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribi?:
I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by
sentDetect
(package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
don't think
this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or
64-bit mode.
The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. --Apple-Mail-8--203371287-- ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Note that you don't need perl = T since by default strapply uses tcl regular expressions and they support \w. What happens if you omit the perl = T? Also please specify the version of gsubfn you are using and if its not the latest then try it with the latest version.
On Tue, Nov 3, 2009 at 11:01 AM, <richard.liu at pueo-owl.ch> wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: ?t <- strapply(d, "\\w+", perl = T). ?I am encountering the following error: Error in base::gsub(pattern, rs, x, ...) : ?Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. ?The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
This seems to be simply integer overflow in a calculation. Changed in R-patched to use doubles. The issue I patched for Kenneth Roy Cabrera was for perl = FALSE only.
On Tue, 3 Nov 2009, William Dunlap wrote:
Here is a more self-contained way to reproduce the problem in 2.10.0 using the prebuilt Windows executable. Putting a trace on gsub in the call to strapply showed that it died in the first call to gsub when the replacement included "\\1" and the string was about 900000 characters long (and included 150000 "words"). It looks like it dies if the string is >= 731248 characters.
d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 731248) nchar(d)
[1] 731248
substring(d, nchar(d)-10)
[1] " abcde abcd"
p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Reached total allocation of 1535Mb: see help(memory.size)
p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Reached total allocation of 1535Mb: see help(memory.size)
Make d one character shorter and it succeeds with either
perl=TRUE or perl=FALSE.
version
_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 10.0 year 2009 month 10 day 26 svn rev 50208 language R version.string R version 2.10.0 (2009-10-26)
sessionInfo()
R version 2.10.0 (2009-10-26) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tcltk_2.10.0 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 3:00 PM To: Kenneth Roy Cabrera Torres Cc: r-help at r-project.org; Uwe Ligges Subject: Re: [R] R 2.10.0: Error in gsub/calloc Kenneth, Thanks for the hint. I downloaded and installed the latest patch, but to no avail. I can reproduce the error on a single sentence, the longest in the document. It contains 743,393 characters. It isn't a true sentence, but since it is more than three standard deviations longer than the mean sentence length, I might be able to use the mean and the standard deviation as a way of weeding ot the really evident "non-sentences" before I take into account the characteristics of the the tokens. Regards, Richard On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
Try the patch version... Maybe is the same problem I had with large database when using gsub() HTH El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribi?:
I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by
sentDetect
(package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
don't think
this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or
64-bit mode.
The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. --Apple-Mail-8--203371287-- ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
I am using gsubfn 0.5-0. When I do not specify perl = TRUE I now get
the following error on the same document:
Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class
= "tclObj") :
[tcl] bad index "1e+05": must be integer?[+-]integer? or end?
[+-]integer?.
Regards,
Richard
On Nov 4, 2009, at 05:34 , Gabor Grothendieck wrote:
Note that you don't need perl = T since by default strapply uses tcl regular expressions and they support \w. What happens if you omit the perl = T? Also please specify the version of gsubfn you are using and if its not the latest then try it with the latest version. On Tue, Nov 3, 2009 at 11:01 AM, <richard.liu at pueo-owl.ch> wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error: Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
1 day later
Bert, Thanks for the tip. Yes, strsplit works, and works fast! For me, white-space tokenization means splitting at the white spaces, so the "^" and the outermost square brackets should/can be omitted. Regards ... from Basel to South San Francisco, Richard
On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
Try: tokens <- strsplit(d,"[^[:space:]]+") This splits each "sentence" in your vector into a vector of groups of whitespace characters that you can then play with as you described, I think (The results is a list of such vectors -- see strsplit()). ## example:
x <- "xx xdfg; *&^%kk "
strsplit(x,"[^[:blank:]]+")
[[1]] [1] "" " " " " " " You might have to use PERL = TRUE and "\\w+" depending on your locale and what "[:space:]" does there. If this works, it should be way faster than strapply() and should not have any memory allocation issues either. HTH. Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org ] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 11:32 AM To: Uwe Ligges Cc: r-help at r-project.org Subject: Re: [R] R 2.10.0: Error in gsub/calloc I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Note that strapply without perl = TRUE runs an order of magnitude faster than with perl = TRUE and takes nearly the same set of regular expressions anyways since its default is tcl regular expressions. strsplit should still be fastest where it applies since splitting is its only purpose.
On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu at pueo-owl.ch> wrote:
Bert, Thanks for the tip. ?Yes, strsplit works, and works fast! ?For me, white-space tokenization means splitting at the white spaces, so the "^" and the outermost square brackets should/can be omitted. Regards ... from Basel to South San Francisco, Richard On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
Try: tokens <- strsplit(d,"[^[:space:]]+") This splits each "sentence" in your vector into a vector of groups of whitespace characters that you can then play with as you described, I think (The results is a list of such vectors -- see strsplit()). ## example:
x <- "xx ?xdfg; *&^%kk ? ?"
strsplit(x,"[^[:blank:]]+")
[[1]] [1] "" ? ? " ?" ? " " ? ?" ? ?" You might have to use PERL = TRUE and "\\w+" depending on your locale and what "[:space:]" does there. If this works, it should be way faster than strapply() and should not have any memory allocation issues either. HTH. Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 11:32 AM To: Uwe Ligges Cc: r-help at r-project.org Subject: Re: [R] R 2.10.0: Error in gsub/calloc I apologize for not being clear. ?d is a character vector of length 158908. ?Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. ?Some of these are really sentences. ?Others are merely groups of meaningless characters separated by white space. ?strapply is a function in the package gosubfn. ?It applies to each element of the first argument the regular expression (second argument). ?Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). ?Thus, with strapply I am simply performing a white-space tokenization of each sentence. ?I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: ?+41 61 331 10 47 Email: ?richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: ?t <- strapply(d, "\\w+", perl = T). ?I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. ?The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Gabor, What about the error message that I got with strapply? That seemed to be the same kind of problem (i.e., integer overflow of index) as with gsub. Regards, Richard On Fri, 6 Nov 2009 08:00:06 -0500, Gabor Grothendieck wrote
Note that strapply without perl = TRUE runs an order of magnitude faster than with perl = TRUE and takes nearly the same set of regular expressions anyways since its default is tcl regular expressions. strsplit should still be fastest where it applies since splitting is its only purpose. On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu at pueo- owl.ch> wrote:
Bert, Thanks for the tip. ?Yes, strsplit works, and works fast! ?For me, white-space tokenization means splitting at the white spaces, so the "^" and the outermost square brackets should/can be omitted. Regards ... from Basel to South San Francisco, Richard On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
Try: tokens <- strsplit(d,"[^[:space:]]+") This splits each "sentence" in your vector into a vector of groups of whitespace characters that you can then play with as you described, I think (The results is a list of such vectors -- see strsplit()). ## example:
x <- "xx ?xdfg; *&^%kk ? ?"
strsplit(x,"[^[:blank:]]+")
[[1]] [1] "" ? ? " ?" ? " " ? ?" ? ?" You might have to use PERL = TRUE and "\\w+" depending on your locale and what "[:space:]" does there. If this works, it should be way faster than strapply() and should not have any memory allocation issues either. HTH. Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 11:32 AM To: Uwe Ligges Cc: r-help at r-project.org Subject: Re: [R] R 2.10.0: Error in gsub/calloc I apologize for not being clear. ?d is a character vector of length 158908. ?Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. ?Some of these are really sentences. ?Others are merely groups of meaningless characters separated by white space. ?strapply is a function in the package gosubfn. ?It applies to each element of the first argument the regular expression (second argument). ?Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). ?Thus, with strapply I am simply performing a white-space tokenization of each sentence. ?I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: ?+41 61 331 10 47 Email: ?richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: ?t <- strapply(d, "\\w+", perl = T). ?I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. ?The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch
I will have a look at it this weekend if you can give me sufficient info to reproduce it. I noticed there was an attachment on one of your emails and it seems to be some sort of binary file with no accompanying description.
On Fri, Nov 6, 2009 at 10:01 AM, Richard R. Liu <richard.liu at pueo-owl.ch> wrote:
Gabor, What about the error message that I got with strapply? ?That seemed to be the same kind of problem (i.e., integer overflow of index) as with gsub. Regards, Richard On Fri, 6 Nov 2009 08:00:06 -0500, Gabor Grothendieck wrote
Note that strapply without perl = TRUE runs an order of magnitude faster than with perl = TRUE and takes nearly the same set of regular expressions anyways since its default is tcl regular expressions. strsplit should still be fastest where it applies since splitting is its only purpose. On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu at pueo- owl.ch> wrote:
Bert, Thanks for the tip. ?Yes, strsplit works, and works fast! ?For me, white-space tokenization means splitting at the white spaces, so the "^" and the outermost square brackets should/can be omitted. Regards ... from Basel to South San Francisco, Richard On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
Try: tokens <- strsplit(d,"[^[:space:]]+") This splits each "sentence" in your vector into a vector of groups of whitespace characters that you can then play with as you described, I think (The results is a list of such vectors -- see strsplit()). ## example:
x <- "xx ?xdfg; *&^%kk ? ?"
strsplit(x,"[^[:blank:]]+")
[[1]] [1] "" ? ? " ?" ? " " ? ?" ? ?" You might have to use PERL = TRUE and "\\w+" depending on your locale and what "[:space:]" does there. If this works, it should be way faster than strapply() and should not have any memory allocation issues either. HTH. Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu Sent: Tuesday, November 03, 2009 11:32 AM To: Uwe Ligges Cc: r-help at r-project.org Subject: Re: [R] R 2.10.0: Error in gsub/calloc I apologize for not being clear. ?d is a character vector of length 158908. ?Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. ?Some of these are really sentences. ?Others are merely groups of meaningless characters separated by white space. ?strapply is a function in the package gosubfn. ?It applies to each element of the first argument the regular expression (second argument). ?Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). ?Thus, with strapply I am simply performing a white-space tokenization of each sentence. ?I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar. Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: ?+41 61 331 10 47 Email: ?richard.liu at pueo-owl.ch On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
richard.liu at pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this is a Mac-specific problem. I have a very large (158,908 possible sentences, ca. 58 MB) plain text document d which I am trying to tokenize: ?t <- strapply(d, "\\w+", perl = T). ?I am encountering the following error:
What is strapply() and what is d? Uwe Ligges
Error in base::gsub(pattern, rs, x, ...) : Calloc could not allocate (-1398215180 of 1) memory This happens regardless of whether I run in 32- or 64-bit mode. ?The machine has 8 GB of RAM, so I can hardly believe that RAM is a problem. Thanks, Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: ?+41 61 331 10 47 Email: ?richard.liu at pueo-owl.ch