TEPHRA TODO

This file is for logging feature requests and bugs during development. Hopefully, having one list will make it easier to keep track of proposed changes. It would be nice to rank the lists in order to prioritize tasks. It should be noted this list is for development purposes and it may go away once a stable release is made.

Command `tephra classifyltrs`

Classify 'best' LTR-RT elements into superfamilies based on domain content and organization
Classify elements into families based on cluster organization; Generate FASTA for each family
Create singletons file of ungrouped LTR-RT sequences
Create FASTA files of exemplars for each LTR-RT family
clean up logs and intermediate files from vmatch (dbluster-*)
combine domain organization from both strands (if the same)?
add family classifications to GFF Name attribute
incorporate legacy annotations from input GFF/reference to family classification
merge overlapping hits in chain of protein matches, and contatenate the rest for each element
mark unclassified elements with no protein domains as LARDs
combine exemplars for efficiently comparing to a reference set
identify fragmented elements with refined full-length elements (handled in v0.08.0+ in 'getfragments' command)
include measure of similarity within/between families
use BLAST role to run searches for 'search_unclassified' method in Tephra::Classify::LTRSfams
investigate why UBN2* domains are being used to classify Gypsy (modified regex in v0.17.7 should solve the problem; need to do full-genome test to confirm)
add DIRS and PLE so we are describing all orders in Wicker's scheme
LARD annotation method not working for GFF3 as of v0.11.0
Family number in FASTA/GFF3 not aligned with that in domain organization file
Domain order is incorrect in family-level domain classication file as of v0.11.0
Add tests for Tephra::Annotation::MakeExemplars class. Currently it is not evaluated (requires larger families to test). This will mean tests will take much longer, but the only want to test this feature is to use the Dev test data for all tests, including the 'findltrs' and 'classifyltrs' commands.

Command `tephra findtirs`

Find all non-overlapping TIR elements passing thresholds
Generate combined GFF3 of high-quality TIRs
Check for index (if given)
Add optional test for the presence of coding domains similar to 'LTRRefine' class. This should reduce the number of DTX elements. Add this to the configuration file for the 'all' command the same as for LTRs.
Mark short elements with no coding potential as MITEs
Output FASTA along with GFF3 like other commands
Split input genome by chromosome to parallelize this (the most time-consuming) part of the analysis

Command `tephra sololtr`

Create HMM of LTRs for each LTR-RT
Search masked ref with LTR HMM
Create GFF with SO terms of solo-LTRs
parallelize hmmsearch to speed things up. likely this is faster than multiple cpus for one model at time
check if input directory exists
make sure to set path to correct version of hmmer
add family name to GFF output (the family name is now in the Parent tag)
add option to pick on the top 20 families to speed up execution
consider preprocessing all LTR files so we don't block on one superfamily waiting for threads to finish
if the soloLTR sequence file is empty, delete all other files and warn no soloLTRs were found
evaluate search results as the process completes so the number of (potentially empty) files does not grow too large

Command `tephra classifytirs`

Classify 'best' TIR elements into superfamilies based on domain content, TSD, and/or motif
Group TIR elements into families based on TIR similarity and/or cluster-based method used for LTR-RT classification
in tests, skip if empty output (none found). This is not a good test honestly, need a new reference
write fasta of each superfamily, and combined library
identify fragmented elements with refined full-length elements
report domain architecture, as for LTR elements
add MITE annotation to GFF3
add MITE annotation to FASTA

Command `tephra findltrs`

Find all non-overlapping LTR-RTs under strict and relaxed conditions
Filter elements by quality score, retaining the best elements
Generate combined GFF3 of high-quality LTR-RTs
Check for index (if given)
change header format to be ">id_source_range"
adjust filtering command to not increment if element has been deleted (inflated filtering stats)
reporting of superfamilies after ltr search?.. better to do that at classification stage
add options for LTR size parameters
add LTR_Finder (caveat: seems too slow in preliminary tests, probably better to continue refining the current methods)
add config file to handle the multitude of LTR-RT constraints
clean up ltrharvest and ltrdigest intermediate files
Add optional test for the presence of coding domains to 'LTRRefine' class. This should reduce the number of RLX elements.
flag suspicious compound elements somehow
adjust domain organization file to allow referencing a specific element or family (perhaps do the domain summary on each family and combine the results)
adding to the above, a final HTML file with family-level identity and domain organization would be useful
Domain matches
- adjust duplicate domain filtering to consider strand and range of matches
- fix reporting of overlapping domain matches by LTRdigest? (issue reported: genometools/genometools#706)
- add e-value threshold option and domain filtering method

Command `tephra findhelitrons`

Find helitrons in reference sequences with HelitronScanner
Generate GFF3 of full-length helitrons
Annotate coding domains in helitrons and include domains in GFF
Adjust header for full length elements to match output of other commands
Remove strand from FASTA header for consistency with other commands

Command `tephra findtrims`

Find all non-overlapping TRIMs under strict and relaxed conditions
Filter elements by quality score, retaining the best elements
Generate combined GFF3 of high-quality TRIMs
Create a feature type called 'TRIM_retrotransposon' to distinguish these elements from other LTR-RTs
create developer tests to operate on a larger data set to positively identify elements rather than just operation of the command

Command `tephra findnonltrs`

break chromosomes to reduce memory usage in hmmsearch (only applies to HMMERv3)
check HMMERv2 var and program version
remove backticks and shell exec of hmmer
remove nasty regex parsing in favor or bioperl reading of report
use list form of system to not fork
run domain searches in parallel
use multiple CPUs (make option) for domain searches
write GFF of results
add verbose option so as to not print progress when there are 5k scaffolds
write combined file of all elements
take a multifasta as input and create directories for input/output to methods
use complete elements to find truncated nonLTRs after masking (do this with complete file at the end on masked genome to get fragments for all types)
use domain/blast based method for classifying elements into families
investigate issues related to why most elements reported on negative strand and contain many gaps
output protein domain sequences for phylogenetic analyses
refactor methods to use shared indexing and domain mapping methods
switch to using HMMERv3 models/programs from HMMERv2
write GO terms to GFF3 for domains

Command `tephra ltrage`

Calculate age for each LTR-RT
Take substitution rate as an option
check if input directory exists
write age to GFF file
Clean up results if requested

Command `tephra maskref`

Generate masked reference from custom repeat library
Add outfile option instead of creating filename
Make some kind of statistical report about masking percentage. It would be helpful to format the output like RepeatMasker to give a global view of what was masked.
Clean up the intermediate folders for each chromosome when masking the genome
Create overlapping windows for masking subsets to solve the issue of reduced representation when generating smaller chunks

Command `tephra illrecomb`

Add correct sequence IDs to report
Investigate the apparent disagreement between the query/subject string and homology strings
Summarize the stats in a more intuitive way so it is clear what the gap summaries mean
Calculate stats from complete repeat database instead of just with LTRs
Do not write all families to disk before processing; do the analysis iteratively as we read the repeat database

Command `tephra tirage`

Update menu for all available options.
Add 3-letter code to age file IDs
Clean up results if requested
Add method to select the top families instead of --all (requires generating families first)

Command `tephra all`

Allow the user to pass a genome and repeat database, along with a species name instead of configuration file.
Generate summary statistics for TE types (domain content, length distribution, diversity, etc.) See (sesbio/transposon_annotation/count_families.pl) for starters.
Generate HTML output for all command. Will need to store JSON data for graphs and tables.
Add tirage options to configuration file.
Remove FASTA/GFF3 files of unclassified elements once the classification process is complete.
Consider removing all FASTA/GFF3 files except the final annotated products. Could add a 'splitgff3' command to produce separate FASTA/GFF3 files from a single GFF3 if going back to files split by TE type is of interest.
Add method to filter LTRs/TIRs that appear to be duplicated genes. This method may fit better in the individual TE finding programs since the 'all' command is not the only use case of Tephra.
Remove duplicate header in family-level domain organization file
Fix Parent IDs getting mixed up when combining LTRs and TRIMs
Add final statistic showing full-length:solo-LTR:truncated ratios
Investigate vertical alignment of stats in log. This appears in Docker image in v0.12.1

Command `tephra reannotate`

Add tests!
Log input FASTA and database used for transferring annotations

Docker image

reduce EMBOSS install to only required programs
do not install BerkeleyDB and DB_FILE (Perl) since they are only recommended now, not required, by BioPerl since v1.7x
get tag from github on build so we are not just pulling the main branch, which may be out of sync with the latest tag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.md

TODO.md

TEPHRA TODO

Command `tephra classifyltrs`

Command `tephra findtirs`

Command `tephra sololtr`

Command `tephra classifytirs`

Command `tephra findltrs`

Command `tephra findhelitrons`

Command `tephra findtrims`

Command `tephra findnonltrs`

Command `tephra ltrage`

Command `tephra maskref`

Command `tephra illrecomb`

Command `tephra tirage`

Command `tephra all`

Command `tephra reannotate`

Meta

Docker image

Files

TODO.md

Latest commit

History

TODO.md

File metadata and controls

TEPHRA TODO

Command tephra classifyltrs

Command tephra findtirs

Command tephra sololtr

Command tephra classifytirs

Command tephra findltrs

Command tephra findhelitrons

Command tephra findtrims

Command tephra findnonltrs

Command tephra ltrage

Command tephra maskref

Command tephra illrecomb

Command tephra tirage

Command tephra all

Command tephra reannotate

Meta

Docker image

Command `tephra classifyltrs`

Command `tephra findtirs`

Command `tephra sololtr`

Command `tephra classifytirs`

Command `tephra findltrs`

Command `tephra findhelitrons`

Command `tephra findtrims`

Command `tephra findnonltrs`

Command `tephra ltrage`

Command `tephra maskref`

Command `tephra illrecomb`

Command `tephra tirage`

Command `tephra all`

Command `tephra reannotate`