Skip to content

[Bioc-devel] Creating Biostrings DNAStringSet from Rcpp/C++

3 messages · Sean Davis, Michael Lawrence, Hervé Pagès

#
Forgive the really naive questions, but my C/C++ and Biostrings skills 
are pretty rough.  I have been playing with Rcpp to interface to an 
external library for reading sequence formats.  I am reading DNA 
sequences one-at-a-time and would like to package a set of those up as a 
DNAStringSet. I have at least two options:

1) Return a character vector back to R and then create a DNAStringSet in 
R.  This works now and works fine.
2) Create the DNAStringSet in the cpp code.

Are there advantages to using choice #2?  If so, where would I start to 
get the necessary Rcpp/Biostrings glue in place? I know how to make new 
S4 objects in Rcpp, so it is really a question of knowing what class 
hierarchy I need to instantiate and, roughly, how.  Any pointers or code 
examples?

Thanks,
Sean
#
You're going to want to use the Biostrings C API. You can see
TwoBitFile_read() in src/twoBit.c in rtracklayer for an example. It's
not trivial.

Michael
On Tue, Jan 5, 2016 at 5:21 AM, Sean Davis <seandavi at gmail.com> wrote:
#
Hi Sean,

The advantage of using choice #2 is better performance (memory
and speed).

In addition to Michael's pointer, there is this post

   https://stat.ethz.ch/pipermail/r-devel/2013-June/066890.html

on R-devel that provides some details and pseudo-code. Unfortunately
a few things have changed since 2013:

   1) You need to replace:

        cachedXVectorList        with   XVectorList_holder
        cachedCharSeq            with   Chars_holder
        cache_XVectorList        with   hold_XVectorList
        get_cachedXRawList_elt   with   get_elt_from_XRawList_holder

   2) For readability you probably want to rename the variables
      accordingly. So replace:

        cached_ans               with   ans_holder
        cached_ans_elt           with   ans_elt_older

   3) The 'seq' member of a Chars_holder struct was renamed 'ptr'.
      So replace:

        cached_ans_elt->seq      with   ans_elt_older->ptr

Also since you want the result in a DNAStringSet object, you'll need
to replace any occurrence of BStringSet with DNAStringSet. One extra
difficulty with DNAStringSet objects is that the incoming string data
needs to be encoded before it's stored in the object. This encoding
can be performed by replacing

   memcpy((char *) cached_ans_elt->seq, foo,
          INTEGER(width)[i] * sizeof(char));

with

   Ocopy_bytes_to_i1i2_with_lkup(0, ans_elt_holder.length - 1,
         (char *) ans_elt_holder.ptr, ans_elt_holder.length,
         foo, INTEGER(width)[i] * sizeof(char),
         INTEGER(lkup), LENGTH(lkup));

where 'lkup' is an integer vector that you need to prepare at the
R level with:

   lkup <- get_seqtype_conversion_lookup("B", "DNA")

and that you then need to pass all the way down to the
read_text_file_in_DNAStringSet() function as an extra argument:

SEXP read_text_file_in_DNAStringSet(FILE *FN, SEXP lkup)
{
     ...
}

Let me know if you need further help with this.

H.
On 01/05/2016 05:31 AM, Michael Lawrence wrote: