Compressing String in R
Since you only have 4 characters, you can can create a table of all the combinations of 4 of them and this will reduce to one byte instead of 4. This is fine if you just want to store them.
x <- expand.grid(c("A","C","G","T"),
+ c("A", "C", "G", "T"),
+ c("A", "C", "G", "T"),
+ c("A", "C", "G", "T"))
gene.table <- apply(x, 1, paste, collapse='') # convert the string (right now it is length mod 4. more logic if not multiple of 4 gene <- "ACGATACGGCGACCACCGAGATCTACACTCTTCCCC" # break into 4 character strings start <- seq(1, by=4, to=nchar(gene)) strings <- mapply(substr, gene, start, start+3) # create new compressed string comp <- as.raw(match(strings, gene.table) - 1) # convert back paste(gene.table[as.integer(comp) + 1], collapse='')
[1] "ACGATACGGCGACCACCGAGATCTACACTCTTCCCC"
On Wed, Dec 24, 2008 at 10:26 AM, Gundala Viswanath <gundalav at gmail.com> wrote:
Dear all, What's the R way to compress the string into smaller 2~3 char/digit length. In particular I want to compress string of length >=30 characters, e.g. ACGATACGGCGACCACCGAGATCTACACTCTTCC The reason I want to do that is because, there are billions of such string I want to print out. And I need to save disk space. - Gundala Viswanath Jakarta - Indonesia
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?