Skip to content
Prev 3397 / 21318 Next

[Bioc-devel] read.XStringSet with spaces in or at end of sequence

Hi Thomas,
On 05/22/2012 10:58 AM, Thomas Girke wrote:
Note that this doesn't fail because the letters in an AAStringSet
object can be anything right now, but it's on my TODO list to change
this i.e. it will become an error to try to store a letter in an
AAStringSet that doesn't belong to the Amino Acid alphabet (stored
in predefined constant AA_ALPHABET).

So the import function to use when one doesn't want to enforce a
particular alphabet is read.BStringSet():

   > read.BStringSet("test.fasta")
     A BStringSet instance of length 1
       width seq                                               names 

   [1]    13 AATTTAAA GGGG                                      123

The other functions in the family (i.e. read.DNAStringSet,
read.RNAStringSet, and read.AAStringSet) will fail if the FASTA file
contains letters that are not in DNA_ALPHABET, RNA_ALPHABET, or
AA_ALPHABET, respectively.
According to Wikipeddia

   http://en.wikipedia.org/wiki/FASTA_format

yes the spaces and any other invalid code should be ignored. My concern
with this behavior though is that removing/ignoring letters in the input 
will shift the positions of all the remaining letters, which for
some use cases is not desirable (maybe everything is fine because all
the letters end up at the right position anyway, but maybe not, hard
to tell without knowing why a space was inserted in the file in the
first place).

Note that we have special letters in the DNA/RNA/AA alphabets that
could be used as a replacement for invalid chars:

   > DNA_ALPHABET
    [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
   > RNA_ALPHABET
    [1] "A" "C" "G" "U" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
   > AA_ALPHABET
    [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" 
"T" "W" "Y"
   [20] "V" "U" "B" "Z" "X" "*" "-" "+"

"-" stands for "gap" and "+" is used for hard masking. IMO they are
both reasonable candidates. I propose to add an extra arg (e.g.
if.invalid.char) to read.DNAStringSet, read.RNAStringSet, and
read.AAStringSet to let the user choose what the substitution letter
should be, e.g. if.invalid.char="+", or if.invalid.char="" (for
removing the invalid letters).

Now should we set its default to "" (and strictly follow the FASTA
spec), or should we set it to NA so by default an error would still
be raised if the file contains invalid chars? I prefer the latter
because I think it's good to let the user know that there is something
uncommon (at best) or potentially wrong with the file.

Thanks for your feedback,
H.