Skip to content

[Bioc-devel] [Rd] Read a text file into R with .Call()

1 message · Hervé Pagès

#
Hi Ge,
On 06/28/2013 04:45 AM, Ge Tan wrote:
So you have a text file with one DNA sequence per line. I wonder what
kind of file is that. Wouldn't it be easier if you managed to store
your DNA sequences in FASTA format so you could just reuse
readDNAStringSet() to load it? If you leave the record headers
empty, the FASTA file won't be significantly bigger than the text file
with one DNA sequence per line.
Chars stored in a DNAStringSet need to be encoded.

The encoding scheme is:

   - DNA base letters A, C, G, and T are associated to bits 0, 1, 2,
     and 3, respectively. So A -> 0x1, C -> 0x2, G -> 0x4, T -> 0x8.

   - IUPAC ambiguity codes are represented by a bit pattern that
     combines the bits of the base letters that they are associated
     with. For example, R is associated with A and G so is represented
     by bits 0 and 2 set to 1 and all other bits set to 0. So R -> 0x5.

   - Special letters - and + are represented by 0x10 and 0x20,
     respectively.

This encoding scheme is summarized by the following lookup table:

   > lkup <- get_seqtype_conversion_lookup("B", "DNA")
   > lkup
     [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA NA
    [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 32 NA 16 
NA NA NA NA
    [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  1 14  2 13 NA NA 
  4 11 NA NA
    [76] 12 NA  3 15 NA NA NA  5  6  8 NA  7  9 NA 10 NA NA NA NA NA NA 
NA  1 14  2
   [101] 13 NA NA  4 11 NA NA 12 NA  3 15 NA NA NA  5  6  8 NA  7  9 NA 10

'lkup' is an integer vector that allows fast encoding from ASCII to
DNA (accepts lower-case DNA letters):

   > lkup[as.integer(charToRaw("ACGTRa-+Xz")) + 1L]
   [1]  1  2  4  8  5  1 16 32 NA NA

To use this lookup table at the C level, just pass it as an extra arg
to your .Call entry point. Then, at the C level, encode the incoming
bytes with something like:

     int lkup_len, key, val;
     char encoded_byte;

     ...
     lkup_len = LENGTH(lkup);
     key = (unsigned char) incoming_byte;
     if (key >= lkup_len || (val = INTEGER(lkup)[key]) == NA_INTEGER) {
         /* Make sure to UNPROTECT() whetever is currently PROTECTed,
            and to free() whatever user-controlled memory is currently
            allocated before you call error(). */
         error("invalid char: '%c'", incoming_byte);
     }
     encoded_byte = (char) val;
     ...

HTH,
H.