Assigning genes to CBS segmented output:
On 10/04/2011 02:44 PM, Angel Russo wrote:
Hi All, I have an CBS segmentation algorithm output for 10 tumor samples each from 2 different tumors. Now, I am in an urgent need to assign gene (followed by all genes present) that belong to a particular segment after I removed all the CNVs from segment data. The format of the data is: Sample Chromosome Start End Num_Probes Segment_Mean Sample1A-TA 1 51598 76187 15 -1.115
Hi Angel -- In Bioconductor http://bioconductor.org for some model organism create a data frame of known Entrez genes and their begin / end locations. Start by installing necessary software and data packages source('http://bioconductor.org/biocLite.R") biocLite(c('org.Hs.eg.db', "GenomicRanges')) then load the library with annotations about genic coordinates library(org.Hs.eg.db) anno = merge(toTable(org.Hs.egCHRLOC), toTable(org.Hs.egCHRLOCEND)) leading to > head(anno) gene_id Chromosome start_location end_location 1 10000 1 -243666483 -244006553 2 10000 1 -243666483 -244006553 3 10000 1 -243651534 -244006553 4 10000 1 -243651534 -244006553 5 100008586 X 49217770 49223847 6 100008586 X 49217770 49332715 For the simple question 'which genes are located on chromosome A starting at X and going to Y' you could subset(geno, Chromosome=="A" & abs(start_location) > X & abs(end_location) < Y) This could also be done through the 'biomaRt' package or GenomicFeatures / TxDb packages. To get this for many segments filter 'anno' to remove funky genes, e.g., those that have negative length(!) idx = with(anno, abs(start_location) > abs(end_location)) anno = anno[!idx,] manipulate this to a GRanges object; library(GenomicRanges) gr = with(anno, GRanges(Chromosome, IRanges(abs(start_location), abs(end_location)), names=gene_id)) convert your CBS result into a GRanges seg = with(CBS, GRanges(Chromosome, IRanges(Start, End))) then find overlaps olap = findOverlaps(gr, seg) the 'gr' is called the 'query', 'seg' is called the 'subject'. queryHits(olap) and subjectHits(olap) give equal-length vectors describing which queries overlap which subjects. You could group gene names by segment with split(names(gr)[queryHits(olap)], subjectHits(olap)) An important issue is to use the same genome build for annotations as you used for segmentation. Hope that helps / provides some hints for getting from A to B. Martin
Could anyone suggest an R library or code or method that I can quickly use to get the genes assigned to CBS output. Thanks so much, Angel [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793