Skip to content
Back to formatted view

Raw Message

Message-ID: <20120522175847.GA730@genomics-59-108.bulk.ucr.edu>
Date: 2012-05-22T17:58:47Z
From: Thomas Girke
Subject: [Bioc-devel] read.XStringSet with spaces in or at end of sequence

Currently, spaces in sequences are handled inconsistently by the FASTA
read functions in Biostrings. This applies to spaces in or at the end of
sequence strings. Because of this users often think Biostrings cannot
handle their sequence data and give up using it which I find
unfortunate.

For instance, given this sequence stored in "test.fasta":
>123
AATTTAAA GGGG

read.DNAStringSet fails to import this sequence which is the
least desirable outcome.

> read.DNAStringSet("test.fasta")
Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names,  : 
  key 32 (char ' ') not in lookup table

however, read.AAStringSet imports it but maintains the space 

> read.AAStringSet("test.fasta")                                                                                                                                                                                                                                                                                              
  A AAStringSet instance of length 1
      width seq                                               names               
      [1]    13 AATTTAAA GGGG                                     123

Wouldn't it make most sense to remove/ignore spaces during the import?

Thomas

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] stats4_2.15.0 tools_2.15.0