[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

edawson · 2020-09-23T17:34:33Z

Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.

One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).

Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.

edawson changed the title ~~[pygenomeworks] evaluate_paf script is too slow to be practical~~ [pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

edawson commented Sep 23, 2020

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

Comments

edawson commented Sep 23, 2020