[Rcpp-devel] add new components to list without specifying list size initially
Hi Walrus,
While I'm a huge fan of Rcpp, I think you'll be a bit better served
(for the time being) to read up on some of the bioconductor packages
that are suited for these types of things.
In particular I am thinking about the IRanges and GenomicRanges
packages. They R wrappers to what is basically an IntervalTree that
you can annotate, and then use to perform fast overlap/intersection
queries.
For instance:
R> library(GenomicRanges)
R> probes <- GRanges('chr1', IRanges(c(81,85), c(85,100)), strand='*')
R> genes <- GRanges(c('chr1', 'chr1', 'chr2'), IRanges(c(11, 111, 11),
c(90, 190, 90)), strand='*',
name=c('g1', 'g2', 'g3'))
R> genes
GRanges with 3 ranges and 1 elementMetadata value
seqnames ranges strand | name
<Rle> <IRanges> <Rle> | <character>
[1] chr1 [ 11, 90] * | g1
[2] chr1 [111, 190] * | g2
[3] chr2 [ 11, 90] * | g3
## How many probes does each gene have land in it?
R> countOverlaps(genes, probes)
[1] 2 0 0
## Which probes are these?
R> subsetByOverlaps(probes, genes)
GRanges with 2 ranges and 0 elementMetadata values
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] chr1 [81, 85] * |
[2] chr1 [85, 100] * |
## and much more stuff
There's a mess load of functionality in IRanges, GenomicRanges,
Biostrings packages that you'll likely find very useful, and efficient
(much of the core of these packages are written in C) if you're doing
a lot of bioinformatics/genomics work. So, taking some time to get
familiar with those will be useful -- you'll find that you'll also
need to drop into Rcpp for other stuff (as I do, too), so it will
still be useful for you in the future.
That's just my 2 cents.
-steve
On Thu, Aug 11, 2011 at 9:44 PM, Walrus Foolhill
<walrus.foolhill at gmail.com> wrote:
Ok, thanks for your answer, but I wasn't clear enough. So here are more
details of what I want to do.
I have one list named "probes":
probes <- list(chr1=data.frame(name=c("p1","p2"),
???????????????? start=c(81,95),
???????????????? end=c(85,100),
???????????????? stringsAsFactors=FALSE))
I also have one list named "genes":
genes <- list(chr1=data.frame(name=c("g1","g2"), start=c(11,111),
end=c(90,190)),
??????????????? chr2=data.frame(name="g3", start=11, end=90))
I need to compare those two lists in order to obtain the following list
which contains, for each gene, the name of the probes included in it:
links <- list(chr1=list(g1=c("p1")))
Here is my R function (assuming that the probes are sorted based on their
start and end coordinates):
fun.l <- function(genes, probes){
? links <- lapply(names(genes), function(chr.name){
??? if(! chr.name %in% names(probes))
????? return(NULL)
??? res <- list()
??? genes.c <- genes[[chr.name]]
??? probes.c <- probes[[chr.name]]
??? for(gene.name in genes.c$name){
????? gene <- genes.c[genes.c$name == gene.name,]
????? res[[gene.name]] <- vector()
????? for(probe.name in probes.c$name){
??????? probe <- probes.c[probes.c$name == probe.name,]
??????? if(probe$start >= gene$start && probe$end <= gene$end)
????????? res[[gene.name]] <- append(res[[gene.name]], probe.name)
??????? else if(probe$start > gene$end)
????????? break
????? }
????? if(length(res[[gene.name]]) == 0)
??????? res[[gene.name]] <- NULL
??? }
??? if(length(res) == 0)
????? res <- NA
??? return(res)
? })
? names(links) <- names(genes)
? links <- Filter(function(links.c){!is.null(links.c)}, links)
? return(links)
}
And here is the beginning of my attempt using Rcpp:
src <- '
using namespace Rcpp;
List genes = List(genes_in);
int genes_nb_chr = genes.length();
std::vector<std::string> genes_chr = genes.names();
List probes = List(probes_in);
int probes_nb_chr = probes.length();
std::vector< std::vector<std::string> > links;
// the main task is performed in this loop
for(int chrnum=0; chrnum<genes_nb_chr; ++chrnum){
? DataFrame genes_c = DataFrame(genes[chrnum]);
? // ... add code to map probes on genes, that is fill "links" ...
}
return wrap(links);
'
funC <- cxxfunction(signature(genes_in="list",
??????????????????????????????? probes_in="list"),
????????????????????? body=src, plugin="Rcpp")
The problem starts quite early: when I compile this piece of code, I get
"error: call of overloaded ?DataFrame(Rcpp::internal::generic_proxy<19>)? is
ambiguous".
What should I do to go through the "probes" and "genes" lists given as
input? Maybe more generically, how can we go through a list of lists (of
lists...) with Rcpp?
2nd (small) question, I don't manage to use Rprintf when using inline, for
instance Rprintf("%d\n", i);, it complains about the quotes. What should I
do to print statement from within the for loop?
Thanks in advance. As my question is very long, I won't mind if you tell me
to find another way by myself. But maybe one of you can put me on the good
track.
On Thu, Aug 11, 2011 at 7:00 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
On 11 August 2011 at 03:06, Walrus Foolhill wrote:
| Hello,
| I need to create a list and then fill it sequentially by adding
components in a
| for loop. Here is an example that works:
|
| library(inline)
| src <- '
| Rcpp::List mylist(2);
| for(int i=0; i<2; ++i)
| ? mylist[i] = i;
| mylist.names() = CharacterVector::create("a","b");
| return mylist;
| '
| fun <- cxxfunction(body=src, plugin="Rcpp")
| print(fun())
|
| But what I really want is to create an empty list and then fill it, that
is
| without specifying its number of components before hand... This is
because I
| don't know in advance at which step of the for loop I will need to
create a new
| component. Here is an example, that obviously doesn't work, but that
should
| show what I am looking for:
|
| Rcpp::List mylist;
| CharacterVector names = CharacterVector::create("a", "b");
If you know how long names is, you know how long mylist going to be ....
| for(int i=0; i<2; ++i){
| ? mylist.add(names[i], IntegerVector::create());
| ? mylist[names[i]].push_back(i);
I don't understand what that is trying to do.
| }
| return mylist;
|
| Do you know how I could achieve this? Thanks.
Rcpp::List is an alias for Rcpp::GenericVector, and derives from Vector.
You
can look at the public member functions -- there are things like
? ?push_back()
? ?push_front()
? ?insert()
etc that behave like STL functions __but are inefficient as we (almost
always) need to copy the whole object__ so they are not recommended.
When I had to deal with 'unknown quantities of data' returning I was
mostly
able to either turn it into a 'fixed or known columns, unknow rows'
problem
(easy, just grow row-wise) or I 'cached' in a C++ data structure first
before
returning to R via Rcpp structures -- and then I knew the dimensions for
the
to-be-created object too.
Dirk
--
Two new Rcpp master classes for R and C++ integration scheduled for
New York (Sep 24) and San Francisco (Oct 8), more details are at
http://dirk.eddelbuettel.com/blog/2011/08/04#rcpp_classes_2011-09_and_2011-10
_______________________________________________ Rcpp-devel mailing list Rcpp-devel at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact