Memory management in R
On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:
My suggestion is to explore other alternatives. (I will admit that I don't yet fully understand the test that you are applying.)
Hi, I am trying to partially implement the Lempel Ziv compression algorithm. The point is that compressibility and entropy of a time series are related, hence my final goal is to evaluate the entropy of a time series. You can find more at http://bit.ly/93zX4T http://en.wikipedia.org/wiki/LZ77_and_LZ78 http://bit.ly/9NgIFt The two that
have occurred to me are Biostrings which I have already mentioned and rle() which I have illustrated the use of but not referenced as an avenue. The Biostrings package is part of bioConductor (part of the R universe) although you should be prepared for a coffee break when you install it if you haven't gotten at least bioClite already installed. When I installed it last night it had 54 other package dependents also downloaded and installed. It seems to me that taking advantage of the coding resources in the molecular biology domain that are currently directed at decoding the information storage mechanism of life might be a smart strategy. You have not described the domain you are working in but I would guess that the "digest" package might be biological in primary application? So forgive me if I am preaching to the choir. The rle option also occurred to me but it might take a smarter coder than I to fully implement it. (But maybe Holtman would be up to it. He's a _lot_ smarter than I.) In your example the long "x" string is faithfully represented by two aligned vectors, each 197 characters in length. The long repeat sequence that broke the grepl mechanism are just one pair of values.
rle(x)
Run Length Encoding lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ... values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ... So maybe as soon as you got to a bundle that was greater than 1/2 the overall length (as happened in the "x" case) you could stop, since it could not have "occurred before".
I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)
algorithm in a trivial way. As a less convoluted example, consider
the series
x <- c("d","a","b","d","a","b","e","z")
If i=4 and therefore the i-th element is the second 'd' in the
series, the shortest series starting from i=4 that I do not see in
the past of 'd' is
"d","a","b","e", whose length is equal to 4 and that is the value
returned by the function below.
The frustrating thing is that I already have the tools I need, just
they crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a
big help for me.
I already offered the Biostrings package. It provides more robust methods for string matching than does grepl. Is there a reason that you choose not to?
David.
> Cheers
>
> Lorenzo
>
> ###########################################################
> entropy_lz <- function(x,i){
>
> past <- x[1:i-1]
>
> n <- length(x)
>
> lp <- length(past)
>
> future <- x[i:n]
>
> go_on <- 1
>
> count_len <- 0
>
> past_string <- paste(past, collapse="#")
>
> while (go_on>0){
>
> new_seq <- x[i:(i+count_len)]
>
> fut_string <- paste(new_seq, collapse="#")
>
> count_len <- count_len+1
>
> if (grepl(fut_string,past_string)!=1){
>
> go_on <- -1
>
> }
> }
> return(count_len)
>
> }
>
> x <- c("c","a","b","c","a","b","e","z")
>
> S <- entropy_lz(x,4)
David Winsemius, MD
West Hartford, CT