Skip to content
Prev 273368 / 398506 Next

Assigning genes to CBS segmented output:

On 10/04/2011 02:44 PM, Angel Russo wrote:
Hi Angel -- In Bioconductor

   http://bioconductor.org

for some model organism create a data frame of known Entrez genes and 
their begin / end locations. Start by installing necessary software and 
data packages

   source('http://bioconductor.org/biocLite.R")
   biocLite(c('org.Hs.eg.db', "GenomicRanges'))

then load the library with annotations about genic coordinates

   library(org.Hs.eg.db)
   anno = merge(toTable(org.Hs.egCHRLOC), toTable(org.Hs.egCHRLOCEND))

leading to

 > head(anno)
     gene_id Chromosome start_location end_location
1     10000          1     -243666483   -244006553
2     10000          1     -243666483   -244006553
3     10000          1     -243651534   -244006553
4     10000          1     -243651534   -244006553
5 100008586          X       49217770     49223847
6 100008586          X       49217770     49332715

For the simple question 'which genes are located on chromosome A 
starting at X and going to Y' you could

   subset(geno, Chromosome=="A" &
                  abs(start_location) > X &
                    abs(end_location) < Y)

This could also be done through the 'biomaRt' package or GenomicFeatures 
/ TxDb packages. To get this for many segments filter 'anno' to remove 
funky genes, e.g., those that have negative length(!)

   idx = with(anno, abs(start_location) > abs(end_location))
   anno = anno[!idx,]

manipulate this to a GRanges object;

   library(GenomicRanges)
   gr = with(anno, GRanges(Chromosome,
                     IRanges(abs(start_location), abs(end_location)),
                     names=gene_id))

convert your CBS result into a GRanges

   seg = with(CBS, GRanges(Chromosome, IRanges(Start, End)))

then find overlaps

   olap = findOverlaps(gr, seg)

the 'gr' is called the 'query', 'seg' is called the 'subject'. 
queryHits(olap) and subjectHits(olap) give equal-length vectors 
describing which queries overlap which subjects. You could group gene 
names by segment with

   split(names(gr)[queryHits(olap)], subjectHits(olap))

An important issue is to use the same genome build for annotations as 
you used for segmentation. Hope that helps / provides some hints for 
getting from A to B.

Martin