Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFF module can write unreadable GFF3 sequences. #99

Open
hexylena opened this issue May 29, 2015 · 1 comment
Open

GFF module can write unreadable GFF3 sequences. #99

hexylena opened this issue May 29, 2015 · 1 comment

Comments

@hexylena
Copy link

Discovered that GFF module supports writing out sequences with IDs that don't parse correctly on reading. If you submit a record to GFF for writing which has an ID which includes a space (e.g. gi|564292986| some protein) then it fails to parse that upon trying to read the same file, because the code is looking for an integer there. Alternatively if the record ID contained numbers (gi|564292986| 324 5348) then you could trick it into finding an incorrect sequence start/end value.

Traceback (most recent call last):
  File "example.py", line 18, in <module>
    for rec in GFF.parse(data):
  File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 738, in parse
    target_lines):
  File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 328, in parse_in_parts
    cur_dict = self._results_to_features(cur_dict, results)
  File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 370, in _results_to_features
    base = self._add_directives(base, results.get('directive', []))
  File "/home/users/cpt/cpt/esr/Projects/cpt/dev/software/python-galaxy-tools/.venv/local/lib/python2.7/site-packages/BCBio/GFF/GFFParser.py", line 389, in _add_directives
    val = (val[0], int(val[1]) - 1, int(val[2]))
ValueError: invalid literal for int() with base 10: 'conserved'

Code to reproduce the problem:

import StringIO
from BCBio import GFF
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

data = StringIO.StringIO()
record = SeqRecord(Seq("ACTG"), id="gi|564292986|ref|YP_008873682.1| conserved hypothetical protein [Staphylococcus phage Sb-1]")
GFF.write([record], data)
print "Wrote: ", data.getvalue()
# Open the GFF3 file
data.seek(0)
for rec in GFF.parse(data):
    print rec.id

IMO, it would be best if the GFF module stripped everything after a space in record IDs when writing.

@Swati6783
Copy link

I am using the script glimmergff_to_proteins.py but I am getting this error. Can you suggest me how to troubleshoot since you are familiar with the code.

Traceback (most recent call last):
File "1.py", line 60, in
main(*sys.argv[1:])
File "1.py", line 29, in main
SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 463, in write
count = writer_class(fp).write_file(sequences)
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 266, in write_file
count = self.write_records(records)
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 250, in write_records
for record in records:
File "1.py", line 35, in protein_recs
for rec in glimmer_predictions(in_handle, ref_recs):
File "1.py", line 53, in glimmer_predictions
for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
File "/home/swatis/Downloads/bcbb-master/gff/BCBio/GFF/GFFParser.py", line 739, in parse
target_lines):
File "/home/swatis/Downloads/bcbb-master/gff/BCBio/GFF/GFFParser.py", line 322, in parse_in_parts
for results in self.parse_simple(gff_files, limit_info, target_lines):
File "/home/swatis/Downloads/bcbb-master/gff/BCBio/GFF/GFFParser.py", line 343, in parse_simple
for results in self._gff_process(gff_files, limit_info, target_lines):
File "/home/swatis/Downloads/bcbb-master/gff/BCBio/GFF/GFFParser.py", line 634, in _gff_process
for out in self._lines_to_out_info(line_gen, limit_info, target_lines):
File "/home/swatis/Downloads/bcbb-master/gff/BCBio/GFF/GFFParser.py", line 664, in _lines_to_out_info
results = self._map_fn(line, params)
File "/home/swatis/Downloads/bcbb-master/gff/BCBio/GFF/GFFParser.py", line 177, in _gff_line_map
assert len(parts) >= 8, line
AssertionError: xyz_132|quiver GlimmerHMM mRNA 4568 5678 . - . ID=xyz_132|quiver.path1.gene1;Name=xyz_132|quiver.path1.gene1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants