Currently, spaces in sequences are handled inconsistently by the FASTA read functions in Biostrings. This applies to spaces in or at the end of sequence strings. Because of this users often think Biostrings cannot handle their sequence data and give up using it which I find unfortunate. For instance, given this sequence stored in "test.fasta":
123
AATTTAAA GGGG read.DNAStringSet fails to import this sequence which is the least desirable outcome.
read.DNAStringSet("test.fasta")
Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names, :
key 32 (char ' ') not in lookup table
however, read.AAStringSet imports it but maintains the space
read.AAStringSet("test.fasta")
A AAStringSet instance of length 1
width seq names
[1] 13 AATTTAAA GGGG 123
Wouldn't it make most sense to remove/ignore spaces during the import?
Thomas
sessionInfo()
R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] stats4_2.15.0 tools_2.15.0