[Bioc-devel] ShortRead readFasta UniProt Incorrect Import

Good day,

If I have a FASTA file that contains
sp|Q9NYW0|T2R10_HUMAN Taste receptor type 2 member 10 OS=Homo sapiens GN=TAS2R10 PE=1 SV=3
MLRVVEGIFIFVVVSESVFGVLGNGFIGLVNCIDCAKNKLSTIGFILTGLAISRIFLIWI
IITDGFIQIFSPNIYASGNLIEYISYFWVIGNQSSMWFATSLSIFYFLKIANFSNYIFLW
LKSRTNMVLPFMIVFLLISSLLNFAYIAKILNDYKTKNDTVWDLNMYKSEYFIKQILLNL
GVIFFFTLSLITCIFLIISLWRHNRQMQSNVTGLRDSNTEAHVKAMKVLISFIILFILYF
IGMAIEISCFTVRENKLLLMFGMTTTAIYPWGHSFILILGNSKLKQASLRVLQQLKCCEK
RKNLRVT

readFasta fails to import it with the warning

proteins <- readFasta('.', "test.fasta")

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
  reading FASTA file test.fasta: ignored 129 invalid one-letter sequence codes

Also, the amino acid sequence is incomplete. There are 308 amino acids, but
width(proteins)
[1] 178

It's undesirable for users that some amino acids are discarded. Hopefully, they notice the warning message before proceeding with the analysis.

Admittedly, readFasta is in ShortRead, so is designed to work with high througput sequencing reads. But, perhaps it would be better suited to a infrastructure package such as Biobase and generalised to correctly import any FASTA file. There's even a Bioconductor workflow at https://www.bioconductor.org/help/workflows/sequencing/ which has a section titled "DNA/amino acid sequence from FASTA files" and demonstrates the use of readFasta.

I used version 1.34.2 of ShortRead which is the newest one.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
Good day,

If I have a FASTA file that contains

sp|Q9NYW0|T2R10_HUMAN Taste receptor type 2 member 10 OS=Homo sapiens GN=TAS2R10 PE=1 SV=3
MLRVVEGIFIFVVVSESVFGVLGNGFIGLVNCIDCAKNKLSTIGFILTGLAISRIFLIWI
IITDGFIQIFSPNIYASGNLIEYISYFWVIGNQSSMWFATSLSIFYFLKIANFSNYIFLW
LKSRTNMVLPFMIVFLLISSLLNFAYIAKILNDYKTKNDTVWDLNMYKSEYFIKQILLNL
GVIFFFTLSLITCIFLIISLWRHNRQMQSNVTGLRDSNTEAHVKAMKVLISFIILFILYF
IGMAIEISCFTVRENKLLLMFGMTTTAIYPWGHSFILILGNSKLKQASLRVLQQLKCCEK
RKNLRVT

readFasta fails to import it with the warning

proteins <- readFasta('.', "test.fasta")

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
   reading FASTA file test.fasta: ignored 129 invalid one-letter sequence codes

Also, the amino acid sequence is incomplete. There are 308 amino acids, but

width(proteins)
[1] 178

It's undesirable for users that some amino acids are discarded. Hopefully, they notice the warning message before proceeding with the analysis.

Admittedly, readFasta is in ShortRead, so is designed to work with high througput sequencing reads. But, perhaps it would be better suited to a infrastructure package such as Biobase and generalised to correctly import any FASTA file. There's even a Bioconductor workflow at https://www.bioconductor.org/help/workflows/sequencing/ which has a section titled "DNA/amino acid sequence from FASTA files" and demonstrates the use of readFasta.
See Biostrings::readAAStringSet (and friends).
I used version 1.34.2 of ShortRead which is the newest one.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

This email message may contain legally privileged and/or...{{dropped:2}}
Hi,

I just modified the Sequence Data workflow to suggest the
use of readDNAStringSet() and family to read in a FASTA file.

Cheers,
H.
On 10/18/2017 01:00 AM, Dario Strbenac wrote:
Good day,

If I have a FASTA file that contains

sp|Q9NYW0|T2R10_HUMAN Taste receptor type 2 member 10 OS=Homo sapiens
GN=TAS2R10 PE=1 SV=3
MLRVVEGIFIFVVVSESVFGVLGNGFIGLVNCIDCAKNKLSTIGFILTGLAISRIFLIWI
IITDGFIQIFSPNIYASGNLIEYISYFWVIGNQSSMWFATSLSIFYFLKIANFSNYIFLW
LKSRTNMVLPFMIVFLLISSLLNFAYIAKILNDYKTKNDTVWDLNMYKSEYFIKQILLNL
GVIFFFTLSLITCIFLIISLWRHNRQMQSNVTGLRDSNTEAHVKAMKVLISFIILFILYF
IGMAIEISCFTVRENKLLLMFGMTTTAIYPWGHSFILILGNSKLKQASLRVLQQLKCCEK
RKNLRVT

readFasta fails to import it with the warning

proteins <- readFasta('.', "test.fasta")

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
   reading FASTA file test.fasta: ignored 129 invalid one-letter
sequence codes

Also, the amino acid sequence is incomplete. There are 308 amino
acids, but

width(proteins)
[1] 178

It's undesirable for users that some amino acids are discarded.
Hopefully, they notice the warning message before proceeding with the
analysis.

Admittedly, readFasta is in ShortRead, so is designed to work with
high througput sequencing reads. But, perhaps it would be better
suited to a infrastructure package such as Biobase and generalised to
correctly import any FASTA file. There's even a Bioconductor workflow
at
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bioconductor.org_help_workflows_sequencing_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=zjUnVdFrQnCYVhojezx1OQ3ulJox_FLqiv8GAl_gzsg&e=
which has a section titled "DNA/amino acid sequence from FASTA files"
and demonstrates the use of readFasta.
See Biostrings::readAAStringSet (and friends).

I used version 1.34.2 of ShortRead which is the newest one.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=i0gBwUFsMcadakXB1QgRHhPyK-ovrJcS-9_s06Vf0dc&e=

This email message may contain legally privileged and/or...{{dropped:2}}

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=i0gBwUFsMcadakXB1QgRHhPyK-ovrJcS-9_s06Vf0dc&e=

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319