Skip to content

[Bioc-devel] C library or C package API for regular expressions

8 messages · Jiří Hon, Charles Determan, Martin Morgan +2 more

#
Dear package developers,

I would like to ask you for advice. Please, what is the most seamless
way to use regular expressions in C/C++ code of R/Bioconductor package?
Is it allowed to bundle some C/C++ library for that (like PCRE or
Boost.Regex)? Or is there existing C API of some package I can depend on
and import?

Thank you a lot for your attention and please have a nice day :)

Jiri Hon
#
Hi Jiri,

I believe you can use the BH package. It contains most of the Boost headers.

Regards,
Charles
On Saturday, January 23, 2016, Ji?? Hon <xhonji01 at stud.fit.vutbr.cz> wrote:

            

  
  
1 day later
#
Hi Charles,

thank you a lot for your helpful hint. There is still a thing that I'm 
not sure about - Boost manual says that Boost.Regex is not header only 
[1]. So as BH package contains only headers, I will have to bundle the 
Boost.Regex library into the package code anyway. Am I right?

Jiri

[1] 
http://www.boost.org/doc/libs/1_60_0/more/getting_started/unix-variants.html#header-only-libraries

Dne 23.1.2016 v 13:35 Charles Determan napsal(a):
 > Hi Jiri,
 >
 > I believe you can use the BH package. It contains most of the Boost 
headers.
 >
 > Regards,
 > Charles
 >
 > On Saturday, January 23, 2016, Ji?? Hon <xhonji01 at stud.fit.vutbr.cz> 
wrote:
 >
 >> Dear package developers,
 >>
 >> I would like to ask you for advice. Please, what is the most seamless
 >> way to use regular expressions in C/C++ code of R/Bioconductor package?
 >> Is it allowed to bundle some C/C++ library for that (like PCRE or
 >> Boost.Regex)? Or is there existing C API of some package I can depend on
 >> and import?
 >>
 >> Thank you a lot for your attention and please have a nice day :)
 >>
 >> Jiri Hon
 >>
 >> _______________________________________________
 >> Bioc-devel at r-project.org mailing list
 >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
 >>
 >
#
There is discussion at

    http://stackoverflow.com/questions/23556205/using-boost-regex-with-rcpp

pointing to

    http://gallery.rcpp.org/articles/boost-regular-expressions/

There is a Bioconductor example in that bundles the regex library at flowCore/src/

    https://github.com/Bioconductor-mirror/flowCore

A second example is in the mzR package.

A real question is, do you really need this functionality at the C level?

A secondary question is that if several packages are using this functionality, then perhaps the library could be bundled separately and made available just once; zlibbioc does something like this (sort of; zlib is only needed on Windows). The flowCore and mzR maintainers (cc'd) might be a valuable resource in this regard.

Martin
#
Hi Martin

Dne 25.1.2016 v 13:08 Morgan, Martin napsal(a):
Thank you for pointing me to the flowCore and mzR packages, these
examples are really helpful.
I think it's unavoidable in my case for performance reasons. I'am trying
to dedect all possible overlapping motifs in DNA compounded from
elements matching some regular expression.
Efficient regexp algorithms seems useful to me for solving many
bioinformatic problems. So it would be natural to have package with C
API to the most efficient regexp libraries.
Dne 23.1.2016 v 13:35 Charles Determan napsal(a):
#
The static library is available for linking once flowCore is installed. 
But headers comes from BH package. So you need to add BH to 'linkingTo' 
field of DESCRIPTION file and point to flowCore's compiled static 
library through 'PKG_LIBS' in your Makevars file. Use 
'flowCore:::LdFlags()' to generate the linking path automatically. See 
the example in 'flowWorkspace/src/Makevars.in'

Once Bioconductor upgrades gcc to 4.9,  all of these will be unnecessary 
. see https://github.com/RGLab/flowWorkspace/issues/160.

Mike
On 01/25/2016 04:08 AM, Morgan, Martin wrote:
#
Hi Jiri,
On 01/25/2016 09:40 AM, Ji?? Hon wrote:
I think Martin's question is: are you sure you need this at the C
level? What makes you think that calling a regex engine from C will
perform better than calling it from R?

Note that using a regex for finding motifs in a DNA sequence has 2
fundamental problems:

(1) It doesn't always find all the matches. For example if 2 matches
     are overlapping, it only returns the 1st of the 2 matches:

   > library(Biostrings)

   > matchPattern("ATAAT", "CCATAATAATGATAAT")
     Views on a 16-letter BString subject
   subject: CCATAATAATGATAAT
   views:
       start end width
   [1]     3   7     5 [ATAAT]
   [2]     6  10     5 [ATAAT]
   [3]    12  16     5 [ATAAT]

   > gregexpr("ATAAT", "CCATAATAATGATAAT")[[1]]
   [1]  3 12
   attr(,"match.length")
   [1] 5 5
   attr(,"useBytes")
   [1] TRUE

(2) It's inefficient on a long DNA sequence:

   > library(BSgenome.Hsapiens.UCSC.hg19)
   > chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
   > system.time(m1 <- matchPattern("ATAAT", chr1))
      user  system elapsed
     0.946   0.000   0.940
   > chr1c <- as.character(chr1)
   > system.time(m2 <- gregexpr("ATAAT", chr1c)[[1]])
      user  system elapsed
     4.109   0.000   4.109

This was actually the very first motivating use case for developing
the Biostrings package. It's important to realize that using the regex
engine at the C level wouldn't make much difference.

matchPattern() and family don't support regex though. However when
working with DNA motifs, the motifs can often be described with IUPAC
ambiguity letters. For example, instead of describing the motifs
with regular expression AT(A|G|T|)T(A|C)GG.G, you can describe it with
ATDTMGGNG. Then you can use matchPattern() on this pattern and with
fixed=FALSE to find all the matches. Additionally you can use the
'max.mismatch' and/or 'with.indels' arguments to allow a small number
of mismatches and/or indels. See ?matchPattern for more information
and examples.

Of course this has its own limitations: you can only do this for a
subclass of regular expressions. For example regular expressions that
use * or + to allow for repetitions cannot be replaced by a sequence
with just IUPAC codes, so the string matching tools in Biostrings
cannnot be used in that case.

Cheers,
H.

  
    
#
Dne 25.1.2016 v 23:34 Herv? Pag?s napsal(a):
Thank you Herv? for your tips. I'm aware of the limited power of regular 
expressions, but using matchPattern doesn't solves my problem. The 
reason for using regexp library at C level is that I plan to call it 
million times (on short DNA parts) and I suppose it would be better to 
avoid the calling and for-loop overhead. Therefore I wanted to get the 
idea about possible regex C APIs I can use or if its usually bundled.

Jirka