Prev 199334 / 398503 Next

R 2.10.0: Error in gsub/calloc

Thu, Nov 5, 2009 10:43 PM

Bert,

Thanks for the tip.  Yes, strsplit works, and works fast!  For me,  
white-space tokenization means splitting at the white spaces, so the  
"^" and the outermost square brackets should/can be omitted.

Regards ... from Basel to South San Francisco,
Richard

On Nov 3, 2009, at 22:03 , Bert Gunter wrote:

Try:

tokens <- strsplit(d,"[^[:space:]]+")

This splits each "sentence" in your vector into a vector of groups of
whitespace characters that you can then play with as you described,  
I think
(The results is a list of such vectors -- see strsplit()).

## example:

x <- "xx  xdfg; *&^%kk    "

strsplit(x,"[^[:blank:]]+")

[[1]]
[1] ""     "  "   " "    "    "


You might have to use PERL = TRUE and "\\w+" depending on your  
locale and
what "[:space:]" does there.

If this works, it should be way faster than strapply() and should  
not have
any memory allocation issues either.

HTH.

Bert Gunter
Genentech Nonclinical Biostatistics



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
] On
Behalf Of Richard R. Liu
Sent: Tuesday, November 03, 2009 11:32 AM
To: Uwe Ligges
Cc: r-help at r-project.org
Subject: Re: [R] R 2.10.0: Error in gsub/calloc

I apologize for not being clear.  d is a character vector of length
158908.  Each element in the vector has been designated by sentDetect
(package: openNLP) as a sentence.  Some of these are really
sentences.  Others are merely groups of meaningless characters
separated by white space.  strapply is a function in the package
gosubfn.  It applies to each element of the first argument the regular
expression (second argument).  Every match is then sent to the
designated function (third argument, in my case missing, hence the
identity function).  Thus, with strapply I am simply performing a
white-space tokenization of each sentence.  I am doing this in the
hope of being able to distinguish true sentences from false ones on
the basis of mean length of token, maximum length of token, or  
similar.

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch


On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:


richard.liu at pueo-owl.ch wrote:

I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
this
is a Mac-specific problem.
I have a very large (158,908 possible sentences, ca. 58 MB) plain
text
document d which I am
trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
encountering the following error:


What is strapply() and what is d?

Uwe Ligges

Error in base::gsub(pattern, rs, x, ...) :
Calloc could not allocate (-1398215180 of 1) memory
This happens regardless of whether I run in 32- or 64-bit mode.  The
machine has 8 GB of RAM, so
I can hardly believe that RAM is a problem.
Thanks,
Richard

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thread (14 messages)

richard.liu at pueo-owl.ch R 2.10.0: Error in gsub/calloc Nov 3 Uwe Ligges R 2.10.0: Error in gsub/calloc Nov 3 Richard R. Liu R 2.10.0: Error in gsub/calloc Nov 3 Kenneth Cabrera R 2.10.0: Error in gsub/calloc Nov 3 Bert Gunter R 2.10.0: Error in gsub/calloc Nov 3 Richard R. Liu R 2.10.0: Error in gsub/calloc Nov 3 William Dunlap R 2.10.0: Error in gsub/calloc Nov 3 Gabor Grothendieck R 2.10.0: Error in gsub/calloc Nov 3 Brian Ripley R 2.10.0: Error in gsub/calloc Nov 3 Richard R. Liu R 2.10.0: Error in gsub/calloc Nov 3 Richard R. Liu R 2.10.0: Error in gsub/calloc Nov 5 Gabor Grothendieck R 2.10.0: Error in gsub/calloc Nov 6 Richard R. Liu R 2.10.0: Error in gsub/calloc Nov 6 Gabor Grothendieck R 2.10.0: Error in gsub/calloc Nov 6