Hi, We are the maintainers of the Bioconductor Rsubread package. We are trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses. We have built the annotation file, but we found that it was a little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb". Is it possible if we can have an exemption to add the mm39 annotation file into the Rsubread package? All the best, Yang
[Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
7 messages · Yang Liao, Shepherd, Lori, Hervé Pagès
Exceptions to file size are not permitted. We would prefer the data be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages. Please see HubPub and the vignette on how to create a hub package https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html It this case, since it is a single file and creating an entirely separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread. You may choose to host the data file yourself on a public accessible and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards. If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above. Minimally, Rsubread would need to add the metadata.csv file that provides the necessary metadata information in inst/extdata. And add the biocViews term AnnotationHubSoftware. Please let us know when these files and changes are available and we can further assist adding the data officially to the AnnotationHub. Cheers, Lori Shepherd - Kern Bioconductor Core Team Roswell Park Comprehensive Cancer Center Department of Biostatistics & Bioinformatics Elm & Carlton Streets Buffalo, New York 14263
From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of Yang Liao <Yang.Liao at onjcri.org.au>
Sent: Friday, April 8, 2022 3:16 AM
To: bioc-devel at r-project.org <bioc-devel at r-project.org>
Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
Sent: Friday, April 8, 2022 3:16 AM
To: bioc-devel at r-project.org <bioc-devel at r-project.org>
Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
Hi,
We are the maintainers of the Bioconductor Rsubread package. We are trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses.
We have built the annotation file, but we found that it was a little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb".
Is it possible if we can have an exemption to add the mm39 annotation file into the Rsubread package?
All the best,
Yang
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
Also just a reminder that RefSeq exons for mm39 are already available
thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
? library(TxDb.Mmusculus.UCSC.mm39.refGene)
? txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
? mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
? mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
? names(mm39_exons) <- NULL
? mm39_exons
? # GRanges object with 243976 ranges and 1 metadata column:
? #? ????????????????? seqnames????????? ranges strand | GeneID
? # ? ??????????????????? <Rle>?????? <IRanges> <Rle> | <character>
? #? ????? [1]???????????? chr1 4878046-4878205????? + | 18777
? #? ????? [2]???????????? chr1 4878678-4878709????? + | 18777
? #? ????? [3]???????????? chr1 4898807-4898872????? + | 18777
? #? ????? [4]???????????? chr1 4900491-4900538????? + | 18777
? #? ????? [5]???????????? chr1 4902534-4902604????? + | 18777
? #? ????? ...????????????? ...???????????? ...??? ... . ...
? # ? [243972] chrUn_JH584304v1???? 55112-55248????? - | 66776
? # ? [243973] chrUn_JH584304v1???? 55465-55701????? - | 66776
? # ? [243974] chrUn_JH584304v1???? 56986-57151????? - | 66776
? # ? [243975] chrUn_JH584304v1???? 58564-58835????? - | 66776
? # ? [243976] chrUn_JH584304v1???? 59592-59689????? - | 66776
? # ? -------
? # ? seqinfo: 61 sequences (1 circular) from mm39 genome
so there should be no need to add anything to AnnotationHub or to
Rsubread itself.
Dump the exons to a tab-delimited file similar to
Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
? df <- as.data.frame(mm39_exons)
? df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
c("GeneID", "width"))])
? stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
"end", "strand")))
? colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
? write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
row.names=FALSE)
The entire process of obtaining the exons and dumping them to the file
takes about 2 seconds on my labtop ;-)
H.
On 08/04/2022 06:00, Kern, Lori wrote:
Exceptions to file size are not permitted. We would prefer the data be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages. Please see HubPub and the vignette on how to create a hub package https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html It this case, since it is a single file and creating an entirely separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread. You may choose to host the data file yourself on a public accessible and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards. If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above. Minimally, Rsubread would need to add the metadata.csv file that provides the necessary metadata information in inst/extdata. And add the biocViews term AnnotationHubSoftware. Please let us know when these files and changes are available and we can further assist adding the data officially to the AnnotationHub. Cheers, Lori Shepherd - Kern Bioconductor Core Team Roswell Park Comprehensive Cancer Center Department of Biostatistics & Bioinformatics Elm & Carlton Streets Buffalo, New York 14263
________________________________
From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of Yang Liao <Yang.Liao at onjcri.org.au>
Sent: Friday, April 8, 2022 3:16 AM
To: bioc-devel at r-project.org <bioc-devel at r-project.org>
Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
Hi,
We are the maintainers of the Bioconductor Rsubread package. We are trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses.
We have built the annotation file, but we found that it was a little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb".
Is it possible if we can have an exemption to add the mm39 annotation file into the Rsubread package?
All the best,
Yang
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com
Thank you, Herv? and Lori!
Indeed, the RefSeq mm39 annotation is available in TxDB, but in our case, we built a special version that were specifically treated and tested for RNA-seq analysis, so we still hope to use the inbuilt version in Rsubread. Maybe we will use a public server to host the annotation files, or further compress it to fit the 5MB file limit.
Thanks again for the very detailed and timely answers, and the example code!
All the best,
Yang
From: Herv? Pag?s<mailto:hpages.on.github at gmail.com>
Sent: Saturday, 9 April 2022 2:21 AM
To: Kern, Lori<mailto:Lori.Shepherd at RoswellPark.org>; Yang Liao<mailto:Yang.Liao at onjcri.org.au>; bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
This message originated from outside your organisation. Please be careful while clicking links, opening attachments, or replying to this email.
Also just a reminder that RefSeq exons for mm39 are already available
thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
library(TxDb.Mmusculus.UCSC.mm39.refGene)
txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
names(mm39_exons) <- NULL
mm39_exons
# GRanges object with 243976 ranges and 1 metadata column:
# seqnames ranges strand | GeneID
# <Rle> <IRanges> <Rle> | <character>
# [1] chr1 4878046-4878205 + | 18777
# [2] chr1 4878678-4878709 + | 18777
# [3] chr1 4898807-4898872 + | 18777
# [4] chr1 4900491-4900538 + | 18777
# [5] chr1 4902534-4902604 + | 18777
# ... ... ... ... . ...
# [243972] chrUn_JH584304v1 55112-55248 - | 66776
# [243973] chrUn_JH584304v1 55465-55701 - | 66776
# [243974] chrUn_JH584304v1 56986-57151 - | 66776
# [243975] chrUn_JH584304v1 58564-58835 - | 66776
# [243976] chrUn_JH584304v1 59592-59689 - | 66776
# -------
# seqinfo: 61 sequences (1 circular) from mm39 genome
so there should be no need to add anything to AnnotationHub or to
Rsubread itself.
Dump the exons to a tab-delimited file similar to
Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
df <- as.data.frame(mm39_exons)
df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
c("GeneID", "width"))])
stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
"end", "strand")))
colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
row.names=FALSE)
The entire process of obtaining the exons and dumping them to the file
takes about 2 seconds on my labtop ;-)
H.
On 08/04/2022 06:00, Kern, Lori wrote:
Exceptions to file size are not permitted. We would prefer the data be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages. Please see HubPub and the vignette on how to create a hub package https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html<https://protect-au.mimecast.com/s/gmF9CE8wwNFnJQ0FPWmPK?domain=bioconductor.org> It this case, since it is a single file and creating an entirely separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread. You may choose to host the data file yourself on a public accessible and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards. If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above. Minimally, Rsubread would need to add the metadata.csv file that provides the necessary metadata information in inst/extdata. And add the biocViews term AnnotationHubSoftware. Please let us know when these files and changes are available and we can further assist adding the data officially to the AnnotationHub. Cheers, Lori Shepherd - Kern Bioconductor Core Team Roswell Park Comprehensive Cancer Center Department of Biostatistics & Bioinformatics Elm & Carlton Streets Buffalo, New York 14263
________________________________ From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of Yang Liao <Yang.Liao at onjcri.org.au> Sent: Friday, April 8, 2022 3:16 AM To: bioc-devel at r-project.org <bioc-devel at r-project.org> Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package Hi, We are the maintainers of the Bioconductor Rsubread package. We are trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses. We have built the annotation file, but we found that it was a little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb". Is it possible if we can have an exemption to add the mm39 annotation file into the Rsubread package? All the best, Yang [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel<https://protect-au.mimecast.com/s/Vjh7CGv00PcqpVOSkFy_E?domain=secure-web.cisco.com> This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you. [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://protect-au.mimecast.com/s/vP6bCJyBBVtym7AfOP23-?domain=stat.ethz.ch>
-- Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com
On 08/04/2022 09:26, Yang Liao wrote:
Thank you, Herv? and Lori! Indeed, the RefSeq mm39 annotation is available in TxDB, but in our case, we built a special version that were specifically treated and tested for RNA-seq analysis,
Would be good to know what that means exactly. If Rsubread uses a subset of RefSeq exons, the curation process should be documented somewhere, for the sake of reproducibility. Best, H.
so we still hope to use the inbuilt version in Rsubread. Maybe we will
use a public server to host the annotation files, or further compress
it to fit the 5MB file limit.
Thanks again for the very detailed and timely answers, and the example
code!
All the best,
Yang
*From: *Herv? Pag?s <mailto:hpages.on.github at gmail.com>
*Sent: *Saturday, 9 April 2022 2:21 AM
*To: *Kern, Lori <mailto:Lori.Shepherd at RoswellPark.org>; Yang Liao
<mailto:Yang.Liao at onjcri.org.au>; bioc-devel at r-project.org
*Subject: *Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB)
into the Bioconductor Rsubread package
This message originated from outside your organisation. Please be
careful while clicking links, opening attachments, or replying to this
email.
Also just a reminder that RefSeq exons for mm39 are already available
thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
? library(TxDb.Mmusculus.UCSC.mm39.refGene)
? txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
? mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
? mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
? names(mm39_exons) <- NULL
? mm39_exons
? # GRanges object with 243976 ranges and 1 metadata column:
? #? ????????????????? seqnames????????? ranges strand | GeneID
? # ? ??????????????????? <Rle>?????? <IRanges> <Rle> | <character>
? #? ????? [1]???????????? chr1 4878046-4878205????? + | 18777
? #? ????? [2]???????????? chr1 4878678-4878709????? + | 18777
? #? ????? [3]???????????? chr1 4898807-4898872????? + | 18777
? #? ????? [4]???????????? chr1 4900491-4900538????? + | 18777
? #? ????? [5]???????????? chr1 4902534-4902604????? + | 18777
? #? ????? ...????????????? ...???????????? ...??? ... . ...
? # ? [243972] chrUn_JH584304v1???? 55112-55248????? - | 66776
? # ? [243973] chrUn_JH584304v1???? 55465-55701????? - | 66776
? # ? [243974] chrUn_JH584304v1???? 56986-57151????? - | 66776
? # ? [243975] chrUn_JH584304v1???? 58564-58835????? - | 66776
? # ? [243976] chrUn_JH584304v1???? 59592-59689????? - | 66776
? # ? -------
? # ? seqinfo: 61 sequences (1 circular) from mm39 genome
so there should be no need to add anything to AnnotationHub or to
Rsubread itself.
Dump the exons to a tab-delimited file similar to
Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
? df <- as.data.frame(mm39_exons)
? df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
c("GeneID", "width"))])
? stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
"end", "strand")))
? colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
? write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
row.names=FALSE)
The entire process of obtaining the exons and dumping them to the file
takes about 2 seconds on my labtop ;-)
H.
On 08/04/2022 06:00, Kern, Lori wrote:
Exceptions to file size are not permitted. We would prefer the data
be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages.
Please see HubPub and the vignette on how to create a hub package
https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html <https://protect-au.mimecast.com/s/gmF9CE8wwNFnJQ0FPWmPK?domain=bioconductor.org>
It this case, since it is a single file and creating an entirely
separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread.
You may choose to host the data file yourself on a public accessible
and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards. If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above.
Minimally, Rsubread would need to add the metadata.csv file that
provides the necessary metadata information in inst/extdata. And add the biocViews term AnnotationHubSoftware.
Please let us know when these files and changes are available and we
can further assist adding the data officially to the AnnotationHub.
Cheers, Lori Shepherd - Kern Bioconductor Core Team Roswell Park Comprehensive Cancer Center Department of Biostatistics & Bioinformatics Elm & Carlton Streets Buffalo, New York 14263
________________________________ From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of
Yang Liao <Yang.Liao at onjcri.org.au>
Sent: Friday, April 8, 2022 3:16 AM To: bioc-devel at r-project.org <bioc-devel at r-project.org> Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into
the Bioconductor Rsubread package
Hi, We are the maintainers of the Bioconductor Rsubread package. We are
trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses.
We have built the annotation file, but we found that it was a little
too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb".
Is it possible if we can have an exemption to add the mm39
annotation file into the Rsubread package?
All the best, Yang [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list
https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel <https://protect-au.mimecast.com/s/Vjh7CGv00PcqpVOSkFy_E?domain=secure-web.cisco.com>
This email message may contain legally privileged and/or
confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
<https://protect-au.mimecast.com/s/vP6bCJyBBVtym7AfOP23-?domain=stat.ethz.ch> -- Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com
Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com [[alternative HTML version deleted]]
Thanks for the reply! We used the flattenGTF function in Rsubread to merge the overlapping exons in each gene; this procedure is documented the manual page of featureCounts. We also checked/tested if there are "tricky" genes in the annotation that we need to take extra care/treatments (e.g. some genes can span multiple chromosomes and/or strands). It is hard to automate all the checks reliably. Also, I think it can be helpful to the reproducibility of DGE analyses if we can have a version of gene annotations relatively stable, not changing when the RefSeq annotation changes between builds. All the best, Yang
From: Herv? Pag?s <hpages.on.github at gmail.com>
Sent: Saturday, 9 April 2022 2:45 AM
To: Yang Liao <Yang.Liao at onjcri.org.au>; Kern, Lori <Lori.Shepherd at RoswellPark.org>; bioc-devel at r-project.org <bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
Sent: Saturday, 9 April 2022 2:45 AM
To: Yang Liao <Yang.Liao at onjcri.org.au>; Kern, Lori <Lori.Shepherd at RoswellPark.org>; bioc-devel at r-project.org <bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
This message originated from outside your organisation. Please be careful while clicking links, opening attachments, or replying to this email.
________________________________
On 08/04/2022 09:26, Yang Liao wrote:
Thank you, Herv? and Lori!
Indeed, the RefSeq mm39 annotation is available in TxDB, but in our case, we built a special version that were specifically treated and tested for RNA-seq analysis,
Would be good to know what that means exactly. If Rsubread uses a subset of RefSeq exons, the curation process should be documented somewhere, for the sake of reproducibility.
Best,
H.
so we still hope to use the inbuilt version in Rsubread. Maybe we will use a public server to host the annotation files, or further compress it to fit the 5MB file limit.
Thanks again for the very detailed and timely answers, and the example code!
All the best,
Yang
From: Herv? Pag?s<mailto:hpages.on.github at gmail.com>
Sent: Saturday, 9 April 2022 2:21 AM
To: Kern, Lori<mailto:Lori.Shepherd at RoswellPark.org>; Yang Liao<mailto:Yang.Liao at onjcri.org.au>; bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
This message originated from outside your organisation. Please be careful while clicking links, opening attachments, or replying to this email.
Also just a reminder that RefSeq exons for mm39 are already available
thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
library(TxDb.Mmusculus.UCSC.mm39.refGene)
txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
names(mm39_exons) <- NULL
mm39_exons
# GRanges object with 243976 ranges and 1 metadata column:
# seqnames ranges strand | GeneID
# <Rle> <IRanges> <Rle> | <character>
# [1] chr1 4878046-4878205 + | 18777
# [2] chr1 4878678-4878709 + | 18777
# [3] chr1 4898807-4898872 + | 18777
# [4] chr1 4900491-4900538 + | 18777
# [5] chr1 4902534-4902604 + | 18777
# ... ... ... ... . ...
# [243972] chrUn_JH584304v1 55112-55248 - | 66776
# [243973] chrUn_JH584304v1 55465-55701 - | 66776
# [243974] chrUn_JH584304v1 56986-57151 - | 66776
# [243975] chrUn_JH584304v1 58564-58835 - | 66776
# [243976] chrUn_JH584304v1 59592-59689 - | 66776
# -------
# seqinfo: 61 sequences (1 circular) from mm39 genome
so there should be no need to add anything to AnnotationHub or to
Rsubread itself.
Dump the exons to a tab-delimited file similar to
Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
df <- as.data.frame(mm39_exons)
df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
c("GeneID", "width"))])
stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
"end", "strand")))
colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
row.names=FALSE)
The entire process of obtaining the exons and dumping them to the file
takes about 2 seconds on my labtop ;-)
H.
On 08/04/2022 06:00, Kern, Lori wrote:
> Exceptions to file size are not permitted. We would prefer the data be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages.
>
> Please see HubPub and the vignette on how to create a hub package https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html<https://protect-au.mimecast.com/s/zdqFCP7LL2SZr7GUzsH7F?domain=bioconductor.org>
>
> It this case, since it is a single file and creating an entirely separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread.
> You may choose to host the data file yourself on a public accessible and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards. If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above.
> Minimally, Rsubread would need to add the metadata.csv file that provides the necessary metadata information in inst/extdata. And add the biocViews term AnnotationHubSoftware.
>
> Please let us know when these files and changes are available and we can further assist adding the data officially to the AnnotationHub.
>
> Cheers,
>
>
>
> Lori Shepherd - Kern
>
> Bioconductor Core Team
>
> Roswell Park Comprehensive Cancer Center
>
> Department of Biostatistics & Bioinformatics
>
> Elm & Carlton Streets
>
> Buffalo, New York 14263
>
> ________________________________
> From: Bioc-devel <bioc-devel-bounces at r-project.org><mailto:bioc-devel-bounces at r-project.org> on behalf of Yang Liao <Yang.Liao at onjcri.org.au><mailto:Yang.Liao at onjcri.org.au>
> Sent: Friday, April 8, 2022 3:16 AM
> To: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org> <bioc-devel at r-project.org><mailto:bioc-devel at r-project.org>
> Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
>
> Hi,
>
> We are the maintainers of the Bioconductor Rsubread package. We are trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses.
>
> We have built the annotation file, but we found that it was a little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb".
>
> Is it possible if we can have an exemption to add the mm39 annotation file into the Rsubread package?
>
> All the best,
> Yang
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel<https://protect-au.mimecast.com/s/SmQyCQnMM3I9E5ATPlJPe?domain=secure-web.cisco.com>
>
>
>
> This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://protect-au.mimecast.com/s/jOEBCRONN4uQkAmFPLCtI?domain=stat.ethz.ch>
--
Herv? Pag?s
Bioconductor Core Team
hpages.on.github at gmail.com<mailto:hpages.on.github at gmail.com>
--
Herv? Pag?s
Bioconductor Core Team
hpages.on.github at gmail.com<mailto:hpages.on.github at gmail.com>
On 08/04/2022 10:02, Yang Liao wrote:
Thanks for the reply! We used the /flattenGTF/?function in Rsubread to merge the overlapping exons in each gene; this procedure is documented the manual page of /featureCounts/. We also checked/tested if there are "tricky" genes in the annotation that we need to take extra care/treatments (e.g. some genes can span multiple chromosomes and/or strands). It is hard to automate all the checks reliably. Also, I think it can be helpful to the reproducibility of DGE analyses if we can have a version of gene annotations relatively stable, not changing when?the RefSeq annotation changes between builds.
I see. thanks for clarifying. Best, H.
All the best, Yang ------------------------------------------------------------------------ *From:* Herv? Pag?s <hpages.on.github at gmail.com> *Sent:* Saturday, 9 April 2022 2:45 AM *To:* Yang Liao <Yang.Liao at onjcri.org.au>; Kern, Lori <Lori.Shepherd at RoswellPark.org>; bioc-devel at r-project.org <bioc-devel at r-project.org> *Subject:* Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package *This message originated from outside your organisation. Please be careful while clicking links, opening attachments, or replying to this email.* ------------------------------------------------------------------------ On 08/04/2022 09:26, Yang Liao wrote:
Thank you, Herv? and Lori! Indeed, the RefSeq mm39 annotation is available in TxDB, but in our case, we built a special version that were specifically treated and tested for RNA-seq analysis,
Would be good to know what that means exactly. If Rsubread uses a subset of RefSeq exons, the curation process should be documented somewhere, for the sake of reproducibility. Best, H.
so we still hope to use the inbuilt version in Rsubread. Maybe we
will use a public server to host the annotation files, or further
compress it to fit the 5MB file limit.
Thanks again for the very detailed and timely answers, and the
example code!
All the best,
Yang
*From: *Herv? Pag?s <mailto:hpages.on.github at gmail.com>
*Sent: *Saturday, 9 April 2022 2:21 AM
*To: *Kern, Lori <mailto:Lori.Shepherd at RoswellPark.org>; Yang Liao
<mailto:Yang.Liao at onjcri.org.au>; bioc-devel at r-project.org
<mailto:bioc-devel at r-project.org>
*Subject: *Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB)
into the Bioconductor Rsubread package
This message originated from outside your organisation. Please be
careful while clicking links, opening attachments, or replying to
this email.
Also just a reminder that RefSeq exons for mm39 are already available
thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
? library(TxDb.Mmusculus.UCSC.mm39.refGene)
? txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
? mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
? mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
? names(mm39_exons) <- NULL
? mm39_exons
? # GRanges object with 243976 ranges and 1 metadata column:
? #? ????????????????? seqnames????????? ranges strand | GeneID
? # ? ??????????????????? <Rle> <IRanges> <Rle> | <character>
? #? ????? [1]???????????? chr1 4878046-4878205????? + | 18777
? #? ????? [2]???????????? chr1 4878678-4878709????? + | 18777
? #? ????? [3]???????????? chr1 4898807-4898872????? + | 18777
? #? ????? [4]???????????? chr1 4900491-4900538????? + | 18777
? #? ????? [5]???????????? chr1 4902534-4902604????? + | 18777
? #? ????? ...????????????? ...???????????? ...??? ... . ...
? # ? [243972] chrUn_JH584304v1???? 55112-55248????? - | 66776
? # ? [243973] chrUn_JH584304v1???? 55465-55701????? - | 66776
? # ? [243974] chrUn_JH584304v1???? 56986-57151????? - | 66776
? # ? [243975] chrUn_JH584304v1???? 58564-58835????? - | 66776
? # ? [243976] chrUn_JH584304v1???? 59592-59689????? - | 66776
? # ? -------
? # ? seqinfo: 61 sequences (1 circular) from mm39 genome
so there should be no need to add anything to AnnotationHub or to
Rsubread itself.
Dump the exons to a tab-delimited file similar to
Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
? df <- as.data.frame(mm39_exons)
? df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
c("GeneID", "width"))])
? stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
"end", "strand")))
? colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
? write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
row.names=FALSE)
The entire process of obtaining the exons and dumping them to the file
takes about 2 seconds on my labtop ;-)
H.
On 08/04/2022 06:00, Kern, Lori wrote:
Exceptions to file size are not permitted. We would prefer the data
be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages.
Please see HubPub and the vignette on how to create a hub package
https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html <https://protect-au.mimecast.com/s/zdqFCP7LL2SZr7GUzsH7F?domain=bioconductor.org>
It this case, since it is a single file and creating an entirely
separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread.
You may choose to host the data file yourself on a public
accessible and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards. If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above.
Minimally, Rsubread would need to add the metadata.csv file that
provides the necessary metadata information in inst/extdata. And add the biocViews term AnnotationHubSoftware.
Please let us know when these files and changes are available and
we can further assist adding the data officially to the AnnotationHub.
Cheers, Lori Shepherd - Kern Bioconductor Core Team Roswell Park Comprehensive Cancer Center Department of Biostatistics & Bioinformatics Elm & Carlton Streets Buffalo, New York 14263
________________________________ From: Bioc-devel <bioc-devel-bounces at r-project.org>
<mailto:bioc-devel-bounces at r-project.org> on behalf of Yang Liao <Yang.Liao at onjcri.org.au> <mailto:Yang.Liao at onjcri.org.au>
Sent: Friday, April 8, 2022 3:16 AM To: bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>
<bioc-devel at r-project.org> <mailto:bioc-devel at r-project.org>
Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB)
into the Bioconductor Rsubread package
Hi, We are the maintainers of the Bioconductor Rsubread package. We are
trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses.
We have built the annotation file, but we found that it was a
little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb".
Is it possible if we can have an exemption to add the mm39
annotation file into the Rsubread package?
All the best, Yang [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel <https://protect-au.mimecast.com/s/SmQyCQnMM3I9E5ATPlJPe?domain=secure-web.cisco.com>
This email message may contain legally privileged and/or
confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
<https://protect-au.mimecast.com/s/jOEBCRONN4uQkAmFPLCtI?domain=stat.ethz.ch> -- Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com <mailto:hpages.on.github at gmail.com>
-- Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com <mailto:hpages.on.github at gmail.com>
Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com [[alternative HTML version deleted]]