[Bioc-devel] Best practices to load data for vignette/tests - Bioc-devel

Tue, Jan 22, 2019 5:57 AM #

Hi everyone,

I am currently working on a R package called BgeeCall allowing to
automatically generate present/absent expression calls from any RNA-Seq
fastq files as long as the species is present in Bgee (https://bgee.org/)
.
The package is almost ready and I am currently writing the vignette and
some tests.

This package can be seen as a workflow taking as input one transcriptome
and at least one fastq file.

My question is how can I import these 2 files to run the vignette/tests?
They are too big to be part of my package.
Can I directly download them from SRA and ensembl (or from my own
server)? Do I need to create a dataset that will be loaded by my package
for this kind of raw and publicly available data?
Do you know if I could reuse some already existing dataset? I am
interested to any best practices infomation.
Thank you for your answers.

Best Regards,

Julien

Shepherd, Lori

Tue, Jan 22, 2019 6:13 AM #

You could see if there is any existing data already in Bioconductor for use with your package.  That would be preferable.


http://bioconductor.org/packages/release/BiocViews.html#___Software


searching for fastq -  you could see what data ShortRead, seqTools, and FastqCleaner

similarly you could also search for rna-seq packages to see if any of their data is appropriate.


There are also a number of experiment data packages that may provide the data format you are in need of.

http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData

You could search here as well.


Lastly,  Bioconductor has an experimentHub for storing large data files. You can search interactively in R or the web API interface here:

https://experimenthub.bioconductor.org/



If none of those location provide data currently in Bioconductor that is suitable for your package,  You can submit your own data to the ExperimentHub.

http://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html

You could download directly but this could be time consuming depending on internet connections and download speeds.  The Bioconductor hubs provide a caching mechanism so it is only downloaded once and then it remembers where the file is on the system for later use.


Cheers,




Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

Julien Wollbrett

Thu, Jan 24, 2019 3:29 AM #

Hello,

Thank you for your helpful answer Lori.
I will create an experimentHub package that will contain one fastq file.

I also needed to access to gtf and transcriptome cdna files from ensembl.
In your website I read that publicly available data like gtf and transcriptome.fa files should not be added to the experimentHub because it is possible to access to them through the annotationHub.

I easily accessed to the path of one transcriptome file I cas interested to using these lines of code :

ah = AnnotationHub()
# query the annotation hub
transcriptome_datasets <- query(ah, c("FaFile","Ensembl", "Caenorhabditis elegans", "Caenorhabditis_elegans.WBcel235.cdna.all.fa"))
# access to local path of the transcriptome dataset
user at transcriptome_path <- transcriptome_datasets[["AH49057"]]$path

I tried to do the same for the annotation GTF file but I can not retrieve the local path of the file once it is downloaded.
I directly access to the content of the file

ah = AnnotationHub()
# query the annotation hub
annotation_datasets <- query(ah, c("GTF","Ensembl", "Caenorhabditis elegans", "Caenorhabditis_elegans.WBcel235.84"))
# retrieve dataset locally and keep path to local file
user at annotation_path <- annotation_datasets[["AH50789"]]$path

I have two questions :
- How is it possible to access to the path of each file downloaded from the annotationHub using AnnotationHub ID?
- Is it normal that I did not find transcriptome of C. elegans more recent than version 81 of ensembl?

Cheers,

Julien

Le 22.01.19 ? 15:13, Shepherd, Lori a ?crit :

You could see if there is any existing data already in Bioconductor for use with your package. That would be preferable.

http://bioconductor.org/packages/release/BiocViews.html#___Software

searching for fastq - you could see what data ShortRead, seqTools, and FastqCleaner

similarly you could also search for rna-seq packages to see if any of their data is appropriate.

There are also a number of experiment data packages that may provide the data format you are in need of.

http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData

You could search here as well.

Lastly, Bioconductor has an experimentHub for storing large data files. You can search interactively in R or the web API interface here:

https://experimenthub.bioconductor.org/

If none of those location provide data currently in Bioconductor that is suitable for your package, You can submit your own data to the ExperimentHub.

http://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html

You could download directly but this could be time consuming depending on internet connections and download speeds. The Bioconductor hubs provide a caching mechanism so it is only downloaded once and then it remembers where the file is on the system for later use.

Cheers,

Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

Shepherd, Lori

Thu, Jan 24, 2019 4:39 AM #

Thank you for your interest.


Feel free to email me off the devel list if you have any more general questions as you start developing your experimentHub package as I am the core team member responsible in assistance.


The GTF files in annotationhub are a specialized case and because of the amount of data,  we do a conversion "on the fly" with the ensembl file rather than directly downloading to a users system.   If you really need to have the raw file locally,  I suggest downloading that file directly (and managing caching through BiocFileCache http://bioconductor.org/packages/3.9/bioc/html/BiocFileCache.html)


Why are the raw files necessary rather than processed? Keep in mind we like to see data in standardized Bioconductor formats as well - so make sure your package utilizes standard classes  https://bioconductor.org/developers/how-to/commonMethodsAndClasses/

and ideally integrate with other packages that analyze similar data.



Bioconductor provides some annotation resources by default and then we rely on outside contributor to provide/maintain the rest.


I recommend splitting the "caenorhabditis" and "elegans" terms in your query to get more hits. There are ensembl 94 GTF "on the fly" available (These are managed by bioconductor but we have not added the 95 yet)


  AH64532 | Caenorhabditis_elegans.WBcel235.94.abinitio.gtf
  AH64533 | Caenorhabditis_elegans.WBcel235.94.gtf


The FASTA to FaFile -  I would have to look into who maintains those records and how they were added in.  we only have what contributors provide if they aren't Bioconductor maintained.


Cheers,



Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

From: Julien Wollbrett <julien.wollbrett at unil.ch>
Sent: Thursday, January 24, 2019 6:29:57 AM
To: Shepherd, Lori; bioc-devel at r-project.org
Subject: Re: Best practices to load data for vignette/tests

Hello,

Thank you for your helpful answer Lori.
I will create an experimentHub package that will contain one fastq file.

I also needed to access to gtf and transcriptome cdna files from ensembl.
In your website I read that publicly available data like gtf and transcriptome.fa files should not be added to the experimentHub because it is possible to access to them through the annotationHub.

I easily accessed to the path of one transcriptome file I cas interested to using these lines of code :

ah = AnnotationHub()
# query the annotation hub
transcriptome_datasets <- query(ah, c("FaFile","Ensembl", "Caenorhabditis elegans", "Caenorhabditis_elegans.WBcel235.cdna.all.fa"))
# access to local path of the transcriptome dataset
user at transcriptome_path <- transcriptome_datasets[["AH49057"]]$path

I tried to do the same for the annotation GTF file but I can not retrieve the local path of the file once it is downloaded.
I directly access to the content of the file

ah = AnnotationHub()
# query the annotation hub
annotation_datasets <- query(ah, c("GTF","Ensembl", "Caenorhabditis elegans", "Caenorhabditis_elegans.WBcel235.84"))
# retrieve dataset locally and keep path to local file
user at annotation_path <- annotation_datasets[["AH50789"]]$path

I have two questions :
- How is it possible to access to the path of each file downloaded from the annotationHub using AnnotationHub ID?
- Is it normal that I did not find transcriptome of C. elegans more recent than version 81 of ensembl?

Cheers,

Julien


Le 22.01.19 ? 15:13, Shepherd, Lori a ?crit :

You could see if there is any existing data already in Bioconductor for use with your package.  That would be preferable.


http://bioconductor.org/packages/release/BiocViews.html#___Software


searching for fastq -  you could see what data ShortRead, seqTools, and FastqCleaner

similarly you could also search for rna-seq packages to see if any of their data is appropriate.


There are also a number of experiment data packages that may provide the data format you are in need of.

http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData

You could search here as well.


Lastly,  Bioconductor has an experimentHub for storing large data files. You can search interactively in R or the web API interface here:

https://experimenthub.bioconductor.org/



If none of those location provide data currently in Bioconductor that is suitable for your package,  You can submit your own data to the ExperimentHub.

http://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html

You could download directly but this could be time consuming depending on internet connections and download speeds.  The Bioconductor hubs provide a caching mechanism so it is only downloaded once and then it remembers where the file is on the system for later use.


Cheers,




Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

________________________________
From: Bioc-devel <bioc-devel-bounces at r-project.org><mailto:bioc-devel-bounces at r-project.org> on behalf of Julien Wollbrett <julien.wollbrett at unil.ch><mailto:julien.wollbrett at unil.ch>
Sent: Tuesday, January 22, 2019 8:57:23 AM
To: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Subject: [Bioc-devel] Best practices to load data for vignette/tests

Hi everyone,

I am currently working on a R package called BgeeCall allowing to
automatically generate present/absent expression calls from any RNA-Seq
fastq files as long as the species is present in Bgee (https://bgee.org/)
.
The package is almost ready and I am currently writing the vignette and
some tests.

This package can be seen as a workflow taking as input one transcriptome
and at least one fastq file.

My question is how can I import these 2 files to run the vignette/tests?
They are too big to be part of my package.
Can I directly download them from SRA and ensembl (or from my own
server)? Do I need to create a dataset that will be loaded by my package
for this kind of raw and publicly available data?
Do you know if I could reuse some already existing dataset? I am
interested to any best practices infomation.
Thank you for your answers.

Best Regards,

Julien

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.



This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.

Shepherd, Lori

Thu, Jan 24, 2019 5:44 AM #

The transcriptome datasets switched to 2Bit files.  We do provide the updated TwoBitFiles in the annotatiuonhub  (again we have not yet added 95 but do have 94).

AnnotationHub with 4 records
# snapshotDate(): 2019-01-14
# $dataprovider: Ensembl
# $species: Caenorhabditis elegans
# $rdataclass: TwoBitFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH65579"]]'

            title
  AH65579 | Caenorhabditis_elegans.WBcel235.cdna.all.2bit
  AH65580 | Caenorhabditis_elegans.WBcel235.dna_rm.toplevel.2bit
  AH65581 | Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.2bit
  AH65582 | Caenorhabditis_elegans.WBcel235.ncrna.2bit





Also to get the path of the AnnotationHub downloaded resorurce please use the format


cache(ah["AH50789"])


instead of


ah[["AH50789"]]$path



Cheers,



Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of Julien Wollbrett <julien.wollbrett at unil.ch>
Sent: Tuesday, January 22, 2019 8:57:23 AM
To: bioc-devel at r-project.org
Subject: [Bioc-devel] Best practices to load data for vignette/tests

Hi everyone,

I am currently working on a R package called BgeeCall allowing to
automatically generate present/absent expression calls from any RNA-Seq
fastq files as long as the species is present in Bgee (https://bgee.org/)
.
Welcome to Bgee: a dataBase for Gene Expression Evolution<https://bgee.org/>
bgee.org
Gene expression data. Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).



The package is almost ready and I am currently writing the vignette and
some tests.

This package can be seen as a workflow taking as input one transcriptome
and at least one fastq file.

My question is how can I import these 2 files to run the vignette/tests?
They are too big to be part of my package.
Can I directly download them from SRA and ensembl (or from my own
server)? Do I need to create a dataset that will be loaded by my package
for this kind of raw and publicly available data?
Do you know if I could reuse some already existing dataset? I am
interested to any best practices infomation.
Thank you for your answers.

Best Regards,

Julien

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.