Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

Open
edawson opened this issue Sep 23, 2020 · 0 comments

Comments

@edawson
Copy link
Contributor

edawson commented Sep 23, 2020

Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.

One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).

Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.

@edawson edawson changed the title [pygenomeworks] evaluate_paf script is too slow to be practical [pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant