Memory management in R

Sat, Oct 9, 2010 4:57 PM

On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:

My suggestion is to explore other alternatives. (I will admit that I
don't yet fully understand the test that you are applying.)

Hi,
I am trying to partially implement the Lempel Ziv compression  
algorithm.
The point is that compressibility and entropy of a time series are  
related, hence my final goal is to evaluate the entropy of a time  
series.
You can find more at

http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt




The two that

have occurred to me are Biostrings which I have already mentioned and
rle() which I have illustrated the use of but not referenced as an
avenue. The Biostrings package is part of bioConductor (part of the R
universe) although you should be prepared for a coffee break when you
install it if you haven't gotten at least bioClite already installed.
When I installed it last night it had 54 other package dependents  
also
downloaded and installed. It seems to me that taking advantage of the
coding resources in the molecular biology domain that are currently
directed at decoding the information storage mechanism of life  
might be
a smart strategy. You have not described the domain you are working  
in
but I would guess that the "digest" package might be biological in
primary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coder
than I to fully implement it. (But maybe Holtman would be up to it.  
He's
a _lot_ smarter than I.) In your example the long "x" string is
faithfully represented by two aligned vectors, each 197 characters in
length. The long repeat sequence that broke the grepl mechanism are  
just
one pair of values.

rle(x)

Run Length Encoding
lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...

So maybe as soon as you got to a bundle that was greater than 1/2 the
overall length (as happened in the "x" case) you could stop, since it
could not have "occurred before".

I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)  
algorithm in a trivial way. As a less convoluted example, consider  
the series

x <- c("d","a","b","d","a","b","e","z")

If i=4 and therefore the i-th element is the second 'd' in the  
series, the shortest series starting from i=4 that I do not see in  
the past of 'd' is

"d","a","b","e", whose length is equal to 4 and that is the value  
returned by the function below.
The frustrating thing is that I already have the tools I need, just  
they crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a  
big help for me.

I already offered the Biostrings package. It provides more robust  
methods for string matching than does grepl. Is there a reason that  
you choose not to?

David.
> Cheers
>
> Lorenzo
>
> ###########################################################
> entropy_lz <- function(x,i){
>
> past <- x[1:i-1]
>
> n <- length(x)
>
> lp <- length(past)
>
> future <- x[i:n]
>
> go_on <- 1
>
> count_len <- 0
>
> past_string <- paste(past, collapse="#")
>
> while (go_on>0){
>
> new_seq <- x[i:(i+count_len)]
>
> fut_string <- paste(new_seq, collapse="#")
>
> count_len <- count_len+1
>
> if (grepl(fut_string,past_string)!=1){
>
> go_on <- -1
>
> }
> }
> return(count_len)
>
> }
>
> x <- c("c","a","b","c","a","b","e","z")
>
> S <- entropy_lz(x,4)

David Winsemius, MD
West Hartford, CT

Memory management in R

Thread (20 messages)