[Bioc-devel] Sequences from non-disk sources
Thank you Martin.
I checked, your method works:
s <- read.AAStringSet(file("stdin"), "fasta")
In C-level, a developer can read files in a non-sequential fashion,
skipping to various places around the file. This would cause a C-level
error if the input is coming from stdin because stdin is implemented
as a sequential stream.
It would make the Biostrings more stable, if an official argument
(e.g. ' "" ' for file) is documented under ?read.AAStringSet and
others, because C-level developers will then avoid the non-sequential
fashion of reading files.
Alex
On Wed, Aug 5, 2009 at 9:59 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
Michael Lawrence <mflawren at fhcrc.org> writes:
On Mon, Aug 3, 2009 at 7:17 PM, Aleksandr Levchuk <alevchuk at gmail.com>wrote:
Dear BioC developers,
Some of my sequences come from non-disk sources:
?Network
?Un-compressors
?Other tool arranged as piplines
I was able to stream such sources into R without touching the disk:
=========================
#!/usr/bin/env Rscript
library(Biostrings)
s <- read.AAStringSet("/dev/stdin", "fasta")
#
# operate on s
#
=========================
Assuming the above file is called my.R, I can run:
?chmod +x my.R
?cat my.fasta.gz | ?gzip -dc | ./my.R
Very powerful and flexible.
But I would like to would eliminate my "hackish" /dev/stdin fifo approach.
Hi Alex
from ?stdin it would appear that your hackish approach is close to R's
recommendation; file("stdin") is documented to access the C-level
stdin. ?For other connections on linux it seems like one needs, e.g.,
gzfile("/dev/stdin"); I don't know about other OS.
The reason this works for things like read.AAStringSet is that at it's
root it uses R's built-in functions like 'scan', 'read.table', and
'readLines'. These make use of connections (the thing returned by
file()) without any additional effort on the part of the package
developer.
Most package developers write parsers that are expecting a character
string naming a file, and then using C's fopen or the like to connect
to a simple files. This is partly because the C-level interface to
'connection' objects is not developer friendly. The two challenges are
thus a) connections are not generally available for all parsers and b)
developers are not likely to be in a position to implement them even
if provided a good use case (and your use case is a really nice
illustration that it would be useful for this to work). These are
general statements, and there might be tweaks to existing code that
would allow more flexible use of connections.
If I'm mistaken and there really is an easy way to use connections in
C, then please correct me!
Martin
What about R connections? There's a gzfile() connection that would handle
the case above, as well as network connections, url().
Just as untested example:
s <- read.AAStringSet(gzfile("my.fasta.gz"), "fasta")
I noticed that functions 'write.XStringSet' and 'write.XStringViews' have an official documented way that allows writing to standard output. Would it be difficult to add an argument to the Biostrings read functions to allow reading sequences from standard input? Alex -- --------------------------------------------------------------- Aleksandr Levchuk Bioinformatic Systems and Databases University of California, Riverside Institute for Integrative Genome Biology
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
? ? ? [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
--------------------------------------------------------------- Aleksandr Levchuk Bioinformatic Systems and Databases Cell Phone: (951) 368-0004 Institute for Integrative Genome Biology University of California, Riverside