-
Notifications
You must be signed in to change notification settings - Fork 3
/
Changes
774 lines (649 loc) · 43.9 KB
/
Changes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
Revision history for Tephra
Version Date Location
0.14.0 04/22/2023 Saskatoon, SK
New features:
- Add 'info' subcommand to print a table of configuration info (Perl and Tephra versions, and versions of all installed
external programs.
Bug fixes:
- Adjust regex for getting divergence from PAML. This is something that unfortunately needs to be adjusted with
updates to PAML.
- Fixed the console messages when configuring and installing required packages so the output is much cleaner.
Misc:
- Update Perl deps when building on v5.30+ (Pod::Find).
- Add required modules for fetching muscle (and other packages) over https (Net::SSLeay and IO::Socket::SSL)
during the initial configuration.
- Change how the tephra command is being called when testing the full pipeline. Now it can be tested in place rather
than expecting to be installed, which is much more desirable (avoids version conflicts and having to install the package
to test the command).
- Pin genometools version to v1.6 to ensure we are working with a stable version.
- Update PAML from v4.8 to v4.10.6.
- Capture the noisy output from EMBOSS compile process. Same with HTSlib and Tephra translate program.
- Pin muscle to v3.8.31 and compile from source because the available binaries have issues in my tests.
- Remove use of Travis-CI and switch to Github Actions for CI workflow.
- Add coverage report from Codecov.
- Add test file for new 'info' subcommand to evaluate results that are returned.
- Update Docker and Github Actions to use cpm, which is much faster than to build than cpanm.
0.13.2 07/20/2020 Vancouver, BC
New features:
- Add correct Sequence Ontology terms for TRIM and LARD elements in GFF3 output. I requested these terms previously
and the changes in this version bring the GFF3 output of Tephra to be in agreement with the latest specifications.
- Misc:
- Update singularity usage method to build locally and avoid permissions issues when pulling pre-build images.
0.13.1 07/03/2020 Vancouver, BC
Bug fixes:
- Fix issue with PATHs to Tephra configured programs (vmatch and mkvtree) not being set in *Annotation::MakeExemplars class.
Misc:
- Updated TODO for this issue because it is hard to test (more details there).
0.13.0 06/30/2020 Vancouver, BC
New features:
- Refactor build system to not used shared libraries and only included required dependency files.
Misc:
- Correct alignment of reporting in log for LTR elements.
0.12.6 04/07/2020 Vancouver, BC
New features:
- Add parallel search of both strands for non-LTR retrotransposons, which reduces
user runtime by about 50% if multiple CPUs are available.
- Use multiple CPUs (if requested) for HMM model searches for non-LTR elements.
- Greatly simplify call to class methods for finding non-LTRs based on the use of
new class attributes (req. fewer method arguments).
- Add resusable for writing/annotating elements (both for families and singleton
elements) in Tephra::Classify::Any.
- Warn now when filtering Helitron/non-LTR elements based on N%.
- Add --debug option for monitoring config/install process.
Bug fixes:
- Adjust non-LTR element N% filtering logic to keep FASTA and GFF3 IDs properly in sync.
- Improve logic for classifying Helitron and non-LTR retrotransposon families by storing
full sequence IDs so no regex or transformations are needed.
- Adjust method for writing elements in Tephra::Classify::Any to include all elements in
a cluster (query and hits) above the threshold.
- As with other family-level classification methods in Tephra, search for families only within
a superfamily for Helitrons and non-LTR elements.
Misc:
- Add role method (*store_seq()) for hashing sequences+IDs that can be used in other classes.
0.12.5 08/26/2019 Vancouver, BC
New features:
- Add method for filtering LTR/TIR/TRIM elements against a user-provided gene set to
remove spurious predictions that are likely tandemly repeated genes.
- Handle compressed input for 'findltrs', 'findtirs', 'findtrims', and 'findfragments'
commands.
Bug fixes:
- Configure all external programs during installation instead of relying on those in PATH.
This fixes the issue of not being able to find programs when running in a container.
Fixes #41 on github.
- Correct the number of elements discovered under different relaxed and strict conditions.
What was being reported was the filtered numbers (correct for the total, but not the correct
number initially identified).
- Remove use of FTP protocol for configuring dependencies (and drop Net::FTP requirement). This
protocol is not supported in all build environments, such as Travis, so it is better to use
wget as a permanent replacement than to keep debugging FTP connection issues.
- Configure install of EMBOSS during testing instead using package manager. Also, remove package
installs of BLAST+ and EMBOSS to speed things up since we configure them locally.
0.12.4 05/27/2019 Vancouver, BC
New features:
- Add protein domain coordinates for non-LTR elements to GFF output.
- Parallelize search for EN and RVT domains in 'findnonltrs' command.
Bug fixes:
- Fix method for how the number of non-LTR and Helitron families are calculated. Previously,
there may have erroneously been single-copy families and singletons that should have been
in families due to how the superfamilies where being indexed and combined.
- In the *Annotation::Util class, there was a bug in getting codes for non-LTR superfamilies
with some codes being redefined.
0.12.3 01/04/2019 Vancouver, BC
New features:
- Add support for compressed input to 'all' command (and add tests with compressed data).
- Add new suite of dev tests for the 'findtirs' and 'classifytirs' commands to
ensure the bugs reported below with respect the element/family numbers are
correct.
- New methods for naming MITEs and LARDs so they are numbered sequentially instead
of taking the number of the original TIR or LTR element.
- Add role for removing repeat regions from feature store that is used in LTR/TIR
classification routines.
- Add simplified method for computing superfamily element counts from all routines,
which involves only logging totals at the family stage after singletons are
processed and classified.
- Add new 'find_mites' method to *Classify::TIRSfams class and search for these elements
prior to the family-level classification steps.
- Store all repeat region features my chromosome source, and sort by chromosome then coordinate
when writing features. This logic is used for all classification and reporting steps and
greatly simplifies the code for ensuring the report is correct. Previously, the features
were stored by region name/number only and sorted on that basis.
Bug fixes:
- Bug fix for 'age' command with LTR/TIR coordinates being redefined and thus
not being processed correctly.
- Bug fix for target_site_duplication being assigned a parent of the TE instead
of repeat_region.
- Bug fix for defining the path to search for index files for 'findltr' and
'findtir' commands.
- Bug fix for 'findtirs' command reporting TIRs the same length of the full
element span (this is bug in 'gt tirvish' but it is handled now).
- Bug fix for 'classifyltrs' and 'classifytirs' commands with the number of elements
per family being incorrectly reported.
- Bug fix with MITE IDs not being updated in the FASTA even though the GFF3 includes
these elements.
- Bug fix for logging long run times (fixes #25 on github).
0.12.2 09/24/2018 Vancouver, BC
- Modify the algoritm for how chromosomes/contigs are processed with the 'maskref'
command to pre-process contigs shorter than the split size. This ensures that
the max number of requested threads is always being used, and greatly reduces
the time required to mask large genomes.
0.12.1 09/11/2018 Vancouver, BC
- Update Dockerfile to git for creating an image from latest builds.
- Add method to get muscle so all dependencies are handled during configuration.
- Bug fix with making combined repeat library with 'all' command and not setting
variable for logging results.
- Add Bio::SearchIO::blastxml dependency to cpanfile. This was split out of BioPerl in
in the 1.7x release.
- Add Docker support to this version, and move core install to separate file.
0.12.0 08/02/2018 Vancouver, BC
New features:
- Refactor LTRStats/TIRStats classes to put common methods in GFF Role or Tephra::Stats::Age class.
- Remove 'ltrage' and 'tirage' commands and create single 'age' command to share refactored/common
methods (updated tests for these changes).
- Add 6 new HMM models to local Pfam database for tryrosine recombinases, endonucleases, and
Helitron_like_N for classifying DIRs, non-LTRs, and Helitron elements, respectively.
- Add methods to now handle unformatted repeat database with 'maskref' command. Previous versions
would print a warning that no classifications could be determined based on the absense of the
3-letter code in the header. Now, the Class, Order, and Repeat will be listed as 'Unknown' and
the number of masked bases will be reported.
- Remove redundant methods for finding LTR exemplars and place in common role under the LTR
namespace (called *LTR::Role::Utils).
- Add method to properly check for and annotate LARD elements. The statistics for these elements
are now logged along with other LTR-RTs.
- Add methods for finding TIR exemplars and place in common role under the TIR
namespace (called *TIR::Role::Utils).
- Refactor the control-flow for the 'age' command option and use new methods for getting exemplars,
or if the --all option is given, use new logic to set output directory for both LTR/TIR types.
- Refactor *SoloLTRSearch class to use more descriptive method for finding exemplars, which is now
in the *LTR::Role::Utils class.
- Add unclassified LTR elements to solo-LTR search instead of only Gypsy and Copia.
- Add location of protein domains to family-level domain classfication report.
- Complete re-write of solo-LTR search method. Now the 'sololtr' command uses nhmmer from HMMERv3
and works straight from the LTR alignment so no model is constructed for the search. In addition, we
now write the files as they are processed instead of writing all LTR FASTA files, to disk, then writing
all alignments, etc. for each step in the process. This saves an enormous amount of time and space, and
the method is more stable now as we're not writing thousands of files to disk.
Bug fixes:
- Bug fix for 'maskref' command not cleaning up intermediate sub-directories. Fixes #24 on github.
- Fix minor bug with header being printed multiple times in famliy-level domain architecture
log file.
- Fix major bug caused by a delete hash key statment in the write_families method in the
*Classify::Fams class that was causing the writing of elements to get out of sync,
and even some elements getting dropped from the combined file.
- Fix bug with element IDs being updated in the domain architecture log and GFF3 after
family-level classification but not FASTA. This caused the ID numbers in the FASTA to differ,
but now they are consistent. This change is in the *Classify::Fams class, but required changes
to both the 'classifyltrs' and 'classifytirs' commands to pass new object references containing
a mapping of updated IDs.
- Modify output in domain architecture log file so the results are printed in descending order
by occurrence for easier interpretation of the results.
- Remove delete key statement from loop when creating domain FASTA files for clustering (this fixes
a bug with the abundance and family numbers getting out of sync).
- Properly sort output of families with respect to abundance, and make sure singletons go into singleltons
files and not families.
- Add fragments to final FASTA repeat datbase so IDs in FASTA and GFF3 are consistent.
- Do not remove singleton LTR sequences in family-level classification step. This allows all elements
to now be considered in solo-LTR search.
- Modify common LTR/TIR finding method to not delete FASTA file of feature parts from singleton elements.
These fix allows us to use all LTR/TIR elements in age calculation and all LTR elements in
solo-LTR search.
- Minor bug fix for family-level classification when concatenating duplicate/split domains for clustering.
Previously, the entire span of the domains was written as the location, which would be inclusive
of other domains. This is corrected in the header now.
- Bug fix for reporting order of protein domains in the family-level domain classification reports. The
logic was correct previously, but the use of "keys" instead of a for-loop over an array
caused the domain order to be randomized in the report.
- We now only report one protein domain in the family-level domain classifcation report when two or more
adjacent domains of the same type have been merged into one span. This is how the classification
works, and hopefully the reporting is more logical now.
0.11.0 04/17/2018 Vancouver, BC
- Modify configuration format to have LTRharvest/LTRdigest options at same level as other
'findltrs' options. This simplifies the parsing and usage of the configuration file.
- Add methods for annotating LARD and MITE elements.
- Add method to write family-level protein domain architecture to make it possible to
quickly look up the domain count/organization for individual elements.
- Modify superfamily-level domain architecture method to take an object instead of a
list of variables.
- Fix syntax with family-level clustering class (*Fams::Cluster) where LTR/TIR GFF3
parsing methods return data results without semicolon to terminate return statements.
- Modify GFF3 annotation method in *Fams class to use proper object-based GFF3 parser from
*Role::GFF3 instead of the previous hand-rolled method. The previous line-based parsing
worked for simple cases but became impractical for annotating elements without coding
domains. The updated code is more flexible, cleaner, and does not suffer from duplication
of code as before.
- Automate fetching data from TAIR (chromosome 1 of TAIR 10, specifically) for running dev tests.
- Modify all tests that have 'development' mode designed to fully test certain methods
to use the aforementioned data from TAIR.
0.10.0 03/27/2018 Vancouver, BC
- Remove redundant 'subseq' method from all classes in favor of reusable 'write_element_parts'
role-based methods.
- Remove redundant 'collate' method from all classes to be now consumed by the *Util role.
- Refactor logging classification and domain organization results from TIRSfams/LTRSfams classes
and 'classifytirs' command to use logging role (*Classify::Role::LogResults).
- Modify logging method for LTR/TIR classification results and domain architecture to use
abstract sprintf-based string formatting instead of manual formatting.
- Expand domain architecture reporting for 'classifytirs' command to also report a combined
summary for both strands.
- Use BLAST role for searching unclassified elements in the *LTRSfams class.
- Add *Alignment::Utils class for importing shared methods for manipulating alignments used
in calculation of LTR/TIR ages.
- Modify 'revcom' method in *TIRStats class to ensure that features are parsed and written in
the correct context.
- Add explicit return to methods for development tests and readability (in particular, there were
no explicit returns from many of the non-LTR classes, which have been now updated).
- Return and log TIR FASTA file from 'findtirs' command. This file was generated previously but
not reported.
- Log output file name of LTR FASTA file from 'findltrs' command.
- Remove unclassifed FASTA/GFF3 files after all classification steps in the 'all' command. This process
is logged to make it clear these files are removed once the final annotated files have been produced.
0.09.9 03/22/2018 Vancouver, BC
- Add Tephra::Genome::Unmask class for unmasking all transposons that were found by searching a
masked genome.
- Modify output of the 'all' command to use the unmasking methods described above for corrected
masked bases in FASTA output.
- Remove custom repeat map building approach from 'maskref' method in favor the reusable method
by the same name that is now in the Tephra::Annotation::Util class.
- Modify logging method for 'findltrs' and 'findtrims' methods to write log in the location of the
data instead of working directory.
- Refactor search for index files in 'findltrs' command to use absolute paths and use map/join for
building search pattern instead of using a for-loop to build patterns.
- Modify how temp files are created throughout the package to use File::Temp functional interface.
This makes creating filehandles cleaner since both file and filehandles are returned from the
function tempfile().
- Add all possible masking options to masking command arguments with the 'all' command.
- Rename variables in *LTRRefine class to not redefine hash keys (the previous approach worked but
was unsafe).
- Modify logic for classifying LTRs/TRIMs in 'all' command. Now, there is no code duplication and
all possibilities of whether both LTRs and TRIMs being found are handled.
- Standardize exemplar file names for LTRs/TIRs and solo-LTR search.
- Add explicit return from methods in *Illrecombination and *FragmentSearch classes.
- Add checks if 'maskref' command returned results from subsequences before attempting to collate
results. This removes warnings that would be raised when there was no results.
- Bug fix for not returning TIR classification directory in 'all' command. This is required by the
'tirage' command.
- Bug fix for not setting correct reference for masking with classified TIRs in 'all' command.
- Modify method for creating combined library to only return the FASTA instead of FASTA and GFF3.
- Modify GFF3 combining method to not take temp GFF3 as an argument.
- Update family-level similarity calculation in *Annotation::Util class to not skip TRIMs since these
are not classified to the family level.
- Modify BLAST search method in *Annotation::Util class to return a string of results instead of writing
results to file.
- Refactor family similarity/length calculation method in *Annotation::Util class to check if there are
multi-member families before opening file. This avoids creating a report with only a header.
- Warn on family similarity/length calculation in *Annotation::Util class if no multi-member families are
found and clean up file if it exists.
- Add check in *Analysis::Pipeline class to see if all GFF3 files are non-empty before sorting and combining.
- Refactor logging method in GT role to use standard interface. Log errors with the data being worked on instead
of the working directory from which the command was launched
- Use standard interface for all methods that use genometools by passing in a hash of arguments, a file,
and a logging object.
0.09.8 03/04/2018 Vancouver, BC
- Add checks to 'findnonltrs' command to test whether any elements were found prior to postprocessing
steps. This will fix the errors (issue #22) that were thrown in the case where no elements are
discovered. Now, warnings are raised but the command proceeds and cleans up intermediate files.
- Redesign 'findnonltrs' tests to always run and test the output, or optionally run on a larger input file
in dev mode. Previously the full command was only tested in dev mode.
0.09.7 02/19/2018 Vancouver, BC
- Modify 'findfragments' command processing to deal with draft genomes that may have hundreds
of thousands of contigs. The output is the same as with the previous version. This satisfies
issue #23.
0.09.6 12/04/2017 Vancouver, BC
- Bug fix for 'findfragments' command with very short intervals not being filtered properly.
- Change 'findfragments' length threshold from 200 bp to 100 bp for detecting small TEs
such as MITEs.
- Refactor overlap detection method for 'findfragments' command to use range-based detection
based on sorted coordinates. This is more efficient and less error-prone than the previous
interval tree approach.
- Add method to automatically generate FASTA of fragments with 'findfragments' command, as
with other commands when the output is GFF3.
- Move 'get_full_sequence' method from *NonLTR::GFFWriter class to *Util role.
0.09.5 11/11/2017 Vancouver, BC
- Add report attribute definition to 'sololtr' command class so the attribute is defined
and set properly.
- Change argument name of input file for 'reannotate' command from 'fasta' to 'infile' to
follow conventions of other commands.
- Fix bug with input arguments to BLAST method in Tephra::Annotation::Transfer class that
was preventing 'reannotate' command from functioning properly.
- Fix argument descriptions and documentation for 'reannotate' command to more accurately
reflect the use of the command.
0.09.4 10/04/2017 Vancouver, BC
- Log final masking results of complete TE library against the reference when the 'all' command
is executed.
- Minor adjustment to final GFF3 sorting to ensure IDs are retained for all elements.
0.09.3 09/09/2017 Vancouver, BC
- Fix Vmatch clean command name for removing index files.
- Add TRIM retrotransposons annotation to family-level classifications for
LTR retrotransposons.
- Remove unclassified LTR retrotransposon FASTA file used for family-level classifications. This is essentially a temp file and keeping this along with the FASTA of classified elements is redundant and confusing.
- Add separate roles for logging, command execution, and Tephra subcommand execution.
- Add a parent class for running all Tephra commands in the 'all' subcommand.
- Add Annotation::Util method to get repeat name and superfamily code from transposon ID.
- Reduce method calls by passing log object to command execution methods instead of generating log
object for each method call.
- Adjust logging method in family-level classification method for LTR/TIR elements in order align results in the log for different transposons types.
- Clean up fai index after test completion.
- Clean up masking log files from 'all' command at completion.
- Document --verbose option for 'findnonltrs' command, and change logging behavior so all output goes
to STDERR.
0.09.2 08/08/2017 Vancouver, BC
- Bug fixes for 'tirage' and 'all' commands to correctly parse family name from the TIR
GFF3 and add these to the summary file. The family-generation method for TIR elements
was added in v0.09.0 and the parser in the Tephra::TIR::TIRStats class was not updated
to parse the 'family' attribute.
0.09.1 07/24/2017 Vancouver, BC
- Bug fix for length/identity summary file with header being printed for each family.
0.09.0 07/10/2017 Vancouver, BC
- Add family-level classification methods for TIR, Helitron, and non-LTR retrotransposons.
These methods are now in the Tephra::Classify::Any class.
- Standarize FASTA and GFF3 ID formats for TIR, Helitron and non-LTR retrotransposons. All TEs
now have the same FASTA header format (">familyNumber_elementID_chromosome_start_end"), and GFF3
ID format ("ID=elementID;family=familyNumber;...").
- Move Tephra::LTR::MakeExemplars to Tephra::Annotation::MakeExemplars and add
some abstraction so the methods now work for finding TIR exemplars also.
- Change Tephra::Classify::LTRFams and subclasses to the more abstract
Tephra::Classify::Fams and modify class methods to work for LTR and TIR elements.
- The --all option to 'tirage' is no longer required since family annotations are
modeled the same way as for LTR retrotransposons. It is now possible to omit the --all
option and only calculate age on the top N families.
- Add method to calculate the global similarity within a TE family and log the results
when the 'all' command is run.
- Add RIL clades to superfamily mapping method in Tephra::Annotation::Util class.
- Adjust IDs of non-LTR and Helitron elements to be SO compliant and not require
and tricks for sorting. This standarizes ID formats for TE types.
- Add method to install and configure Vmatch during the build process.
- Remove development flag for tests involving Vmatch ('classifyltrs', 'classifytirs', and
'maskref'). Now these tests will be run on each build and not just manually.
0.08.1 06/12/2017 Vancouver, BC
- Bug fix for non_LTR_retrotransposon IDs getting dropped in final GFF3.
0.08.0 06/07/2017 Vancouver, BC
- Bug fix for Helitron IDs getting dropped in final GFF3 file.
- Bug fix for sorting unclassified matches in LTRFams class. Hits are now sorted
and scored as expected.
- Bug fix for not using output filename from the configuration file. A custom name was
being generated for the GFF3 and the configuration option was previously ignored.
- Bug fix for leaving empty solo-LTR files when none are discovered. Now a warning is
logged when no solo-LTRs are found and the empty files are removed.
- Bug fix for generating expected help menus with long and short options.
- Bug fix for classifytirs test not testing the correct command for all tests.
- Bug fix for not cleaning up all intermediate results from 'all' command as expected (this
only affected the 'findtrims' and 'ltrage' commands).
- Bug fix for not getting substitution rate from configuration file and using default.
- Bug fix for not cleaning up results from ltrage and tirage commands.
- Modify behavior of 'run_blast' method in Tephra::Role::Run::Blast role. The sort
attribute now takes a string to specify how to sort: either by coordinate or bit score.
Also, the option to set the e-value threshold was added.
- Add 'findfragments' command (and Tephra::Genome::FragmentSearch class with associated
methods) for identifying truncated or fragmented elements.
- Refactor 'trim_search' method in Tephra::TRIM::TRIMSearch class to take the mode for the
search ('relaxed' or 'strict') and set variables accordingly instead of duplicating
code across two separate methods.
- Refactor 'ltr_search' method in Tephra::LTR::LTRSearch class to take the mode for the
search ('relaxed' or 'strict') and set variables accordingly instead of duplicating
code across two separate methods.
- Refactor how genometools commands are run and evaluated.
- Refactor all tests to avoid using the shell for running commands.
- Refactor 'findltrs' and 'findtrims' tests to not retry failed tests.
- Renumber test files so they are sequential with not numbers being skipped.
- Add longer usage description to main program and all subcommands.
- Document default thread level for all commands that can use more than one process.
- Add the option to use more than one processor for 'findltrs', 'findtirs', and 'findtrims'
commands.
- Change the default thread level for tests to 2 (was set much higher, 24, for local
testing).
- Add 'tirage' command results to 'all' subcommand.
- Add method to combine and summarize TIR and LTR age results by family to 'all' command.
- Add Tephra::Annotation::Util class for working with transposon names.
0.07.2 05/11/2017 Vancouver, BC
- Add configuration option for 'findltrs' command to filter elements based on the
lack of coding domains. This reduces the number of unclassifed elements.
- Remove duplicated warning about missing logfile option in Tephra::LTR::LTRRefine
class.
0.07.1 04/03/2017 Vancouver, BC
- Add bug fix for genome attribute not being defined when refining TRIM elements.
- Check if domain matches are defined prior to evaluation in Tephra::NonLTR::Postprocess.
0.07.0 03/06/2017 Vancouver, BC
- Add 'all' command to run all commands and generate combined FASTA and GFF3 output
files.
- Modify configuration for 'findltrs' command and add configuration class (Teprha::Config::Reader)
for running all methods.
- Bug fix for 'classifytirs' command to add 3-letter superfamily code to GFF3 file as an attribute.
- Add logger for all methods.
- Add new tests for all command and logging methods.
- Bug fix for cleaning GT index files (method was incorrectly searching subdirectories).
- Simplify setting HTS index for 'classifytirs' method. Now the index is created once and passed
to individual classification methods instead of recreating indexes for each method.
- Simplify running 'findltrs' command by setting all options in configuration file instead of having
a mix of command line options and configuration file.
- Document --clean option for 'maskref' command.
- Document --debug option for 'classifyltrs' command.
- Correct documentation for 'findtrims' and 'tirage' commands by removing mention of LTR files.
- Change 'percentcov' option for 'sololtr' command to be an integer instead of a fraction to be
consistent with other class options.
- Fix description of 'matchlen' option for 'sololtr' command to reflect proper usage (requires
an integer not a fraction).
- Add repeat database file to output of 'maskref' command for better logging.
- Add better regex for getting GFF3 sequence declarations so as to not capture section dividers.
- Add check and clean up of empty files with 'findtrims' command when no elements are discovered.
- Add check for count of elements with 'classifyltrs' and 'classifytirs' command so as to not
trigger warnings when trying to calculate statistics and format the output.
- Deprecated the percent coverage threshold option for 'sololtr' command. The match length and
percent identity are sufficient to filter weak matches and the coverage option added complexity
to the usage and implementation.
- Change 'ltrage' command to take classified LTRs directory as input and process all superfamilies
instead of taking individual superfamily directories as input.
- Fix bug in 'ltrage' command for getting LTR exemplar seqs. A variable that was not being initialized
was tested against for how to process exemplars.
- Bug fix with trying to resolve the configuration directory if it is not set at installation.
- Bug fix with setting BLAST+ directory by removing version from local directory name.
0.06.1 2/09/17 Vancouver, BC
- Add method to evaluate whether the BLAST-merging step increases the number of elements
in families for 'classifyltrs' command over the domain-based clustering method.
For plants, the BLAST-merging step is routinely used for family-level designations and leads
to fewer singletons. For animals, or species with divergent LTR elements, the BLAST-merging
step will create all singletons and no elements in families, so evaluating both procedures
is the current best general approach to classification of LTR families.
- Bug fix for 'findtrims' command that was exiting when no elements were found under strict
conditions. Now this is handled correctly and output will be written.
- Bug fix that was creating duplicate sequence region declarations in GFF3 for 'findnonltrs' command
on some genomes.
- Add tests to see if hmmsearch report was parsed correctly and all fields defined with 'findnonltrs'
command.
- Modify processing of input with 'findnonltrs' command to return generated files directly.
- Bug fix for 'sololtr' command to use SO compliant terms in GFF3 output.
- Adjust output of 'maskref' command to account for different run lengths when formatting results
table.
- Add some error handling for 'maskref' command when there is an issue with the input (or no input genome
file can be found). Now it should die with a message instead of throwing warnings and generating an
empty table.
0.06.0 1/25/17 Vancouver, BC
- Clean up file paths across in all classes to use absolute paths instead
of relative to avoid problems.
- Modify handling of genometools index files to not assume they are in the working directory.
This ensures they will be cleaned up as expected.
- Reduce output of 'findltrs' and 'findtrims' commands to GFF3 only.
- Add --debug option for 'findtrims' command.
- Add -o output file option for 'findtrims' command to clarify what result files are generated.
- Reduce memory usage with 'findhelitrons' command and clean up intermediate files.
- Add checks for required input files to 'maskref' command.
- Modify output of 'findnonltrs' command to have identical basename for FASTA/GFF3.
- Modify output of 'classifyltrs' command to have identical basename for FASTA/GFF3.
- Modify output of 'findtirs' command to have identical basename for FASTA/GFF3.
- Modify output of 'findtrims' command to have identical basename for FASTA/GFF3.
- Modify output of 'findhelitrons' command to have identical basename for FASTA/GFF3.
- Modify output of 'classifytirs' command to have identical basename for FASTA/GFF3.
- Change short option for input GFF3 to 'classifytirs' command to '-i' for consistency.
- Trim end of translated file with 'findnonltrs' command and silence trivial warnings during
translation.
- Remove alignment files generated by 'illrecomb' command by default.
- Add correct 3-letter code to FASTA output of 'findtrims' command for getting the type
when generating masking report.
- Fix bug that was causing chromosomes to not be indexed with 'findnonltrs' command.
- Fix bug that was causing warnings from 'maskref' command due to alignments not being reported.
0.05.0 1/12/17 Vancouver, BC
- Add test to 'sololtr' command to ensure the expected file of LTR exemplar sequences
exists.
- Add requirement of a filename for --seq option with 'sololtr' command, which makes
testing and setting options more stable.
- Add method to process both superfamilies concurrently with 'sololtr' command to
avoid waiting for one to finish before starting to process the other, while
also handling individual sequences with child processes. This greatly speeds up
processing, especially for large genomes.
- Store LTR family statistics when classifying LTRs to ensure the data for each
superfamily is presented together (not split up as other threads are finishing).
- Properly set threads attribute when making LTR family exemplars (was being set to 1
regardless of options).
- Increase Perl version requirement to 5.14, which is the minimum required to
build Bio::DB::HTS.
- Write all errors to STDERR instead of sometimes going to STDOUT for all classes.
- Add the sum of all deletions to report for 'illrecomb' command to get an idea about
the total DNA removal for an LTR family.
0.04.5 12/12/16 Vancouver, BC
- Modify 'illrecomb' command to take combined LTR family FASTA file instead
of input directory of classified elements. This prevents having to keep intermediate
files or regenerate data to calculate patterns of DNA removal.
0.04.4 12/01/16 Vancouver, BC
- Modify 'maskref' command to now generate overlapping subsets of the genome
(the specifics of the length and overlaps can be changed by the user) for
masking. This allows more accurate estimates to be gained with very large
(i.e, megabase sized) subsets, which greatly speeds up the process.
The overlap size is added to the output table for comparing different parameters.
0.04.3 11/09/16 Vancouver, BC
- Add verbose option to 'findnonltrs' command to monitor progress.
- Add method to remove progress for each chromosome when masking with 'maskref' command
to same disk space.
- Quiet the build of PAML when configuring dependencies. We still check for errors but the
user does not need to see the warnings.
0.04.2 11/05/16 Vancouver, BC
- Add new LTR family joining method that is less stringent. Now families are joined based
on shared clustering patterns as opposed to the previous requirement of elements having to
display the exact same clustering pattern to be included in a family.
- Refactor LTR family-finding methods by splitting out clustering methods into separate role.
Now, the superfamily- and family-finding methods are separated in the command class. We now
only call algorithms once instead of separately for each superfamily. This allows all the
methods to be run in parallel (by default).
- Add 'tirage' command for calculating the age of TIR transposons.
- Fix bug with 'maskref' command with so complete refernce is written to final file instead of
the individual split subsets.
- Adjust the --clean option with 'maskref' command so results may be kept for debugging or other
purposes.
- Add check to 'maskref' to ensure the correct number of subsets are created from the reference.
- Add option to modify the percent identity threshold for matches with the 'maskref' command.
- Add method to not block when masking reference and generating masking report with 'maskref' command.
This change results in about 40-50% reduction in runtimes when masking.
- Add method to avoid hardcoding BLAST+ version so the latest is always fetched properly.
- Add method to run finding exemplars with 'classifyltrs' command in parallel.
0.04.1 10/12/16 Vancouver, BC
- Update BLAST+ release.
0.04.0 10/04/16 Vancouver, BC
- Add options for controlling how the solo-LTR finding works. There is a reasonable default now,
20 families, so that the command will not attempt to do hundreds of alignments and run for
days or weeks.
- Modify the masking algorithm to give finer control over how stringent the masking should
be.
- Parallelize the masking procedure to process chunks of chromosomes (default is 50kb) at a time.
This greatly reduces the memory usage and speeds up the process.
- Add method to write a table of masking results to show what percentage of each repeat type was masked
in the reference.
0.03.9 09/08/16 Vancouver, BC
- Adjust FASTA header in findhelitrons output to match the ID in the GFF output. Fix
variable naming issue in Helsearch class with sequence IDs (though it did not appear to
cause issues it could have because the ID variable was being reused/renamed).
0.03.8 08/30/16 Vancouver, BC
- Configure Tephra to install an HMM library of TE models and a database of tRNA sequences.
This change reduces the dependencies to only a genome as the input to the search programs.
0.03.7 08/18/16 Vancouver, BC
- Parallelize sololtr command to speed up the solo-LTR search.
- Update the sololtr command usage menu which did not reflect all of the available options.
0.03.6 08/09/16 Vancouver, BC
- Increase BLAST+ package version.
- Simplify install instructions to use latest release only.
0.03.5 07/28/16 Vancouver, BC
- Add reannotate command for transferring annotations from a reference set to Tephra annotations.
- Fix minor issue with help command not being recognized with main tephra command.
0.03.4 07/06/16 Vancouver, BC
- Bug fix for formatting output of illrecomb command (fixes #16 on github).
- Refactor illrecomb methods to not write intermediate files.
- Add method to clean up intermediate results from findnonltrs command.
- Adjust muscle command for sololtr and illrecomb commands to not throw away errors.
0.03.3 06/24/16 Vancouver, BC
- Bug fix for ltrage command. The wrong arguments to muscle were introduced in changes for v0.03.2, which
has been updated.
0.03.2 06/21/16 Vancouver, BC
- Bug fix in setting the wrong repeat_region ID for elements when evaluating overlaps (fixes #15 on github).
- Bug fix for parsing Helitron IDs from final FASTA file. A minimal parser (no dependencies) was added to the class
to account for Bio::DB::HTS::Kseq not parsing anything past the first whitespace.
- Remove use of clustalw2 everywhere in favor of muscle. This solves a couple of issues, as clustalw2 was quite noisy and
error-prone when compiling from source.
- Increase default number of matches to consider in LTR family clustering/classification in an arbitrarily large number.
This will keep families from split based on the size in case there is an unusually large number of matches.
0.03.1 06/09/16 Vancouver, BC
- Add --all option to 'ltrage' command for calculating age of all LTR-RTs in a GFF
instead of just the default exemplars. This makes getting the age a bit easier now
by just requiring one command as opposed to one for each superfamily.
0.03.0 06/08/16 Vancouver, BC
- Remove use of SAMtools for indexing/extracting subsequences.
- Remove use of Bio::SeqIO (BioPerl) for parsing sequence files.
- Add Bio::DB::HTS for extracting subsequences and parsing sequences.
- Fix bugs in getting TIR element range (previously off by 1).
- Rework solo-LTR search to look for exemplars and extract LTR features of all
elements if they are not found. This avoids duplicating work done in the exemplar-finding step.
- Add test file of exemplars to speed up testing (also, no exemplars would be found in
the small test set)
- Remove use of Bio::Tools::GFF (BioPerl) in favor of Bio::GFF3::LowLevel for parsing GFF features.
This is a made a huge performance improvement, as well as simplifying how features would be formatted
for writing.
- Add strand to non-LTR output (GFF).
- Add the same transposase filtering method to all steps of LTR classification.
- Adjust gap filtering step in non-LTR search to filter elements only in the final stage of classification.
0.02.7 05/06/16 Vancouver, BC
- Bug fix with HMMERv2 path not being set correctly for 'findnonltrs' command.
- Add threshold for percent of Ns in non-LTR-RTs.
- Bug fix to get --help and --man options working correctly.
- Add method to remove leading or trailing gap sequences in 'findnonltrs' command.
- Expand list of transposase-related models to filter prior to LTR family classification.
0.02.6 04/26/16 Vancouver, BC
- Bug fix with LTR-RT exemplar IDs causing issues with generating the LTR exemplar files.
- Bug fix in LTR-RT refinement step to check that elements exist before trying to summarize features.
- Remove used method for getting TE SO terms (added to the 'Util' role to be used by other classes).
0.02.5 04/21/16 Vancouver, BC
- When classifying LTR-RT families, skip transposase domain matches to avoid artifacts in the family assignment.
- Bug fix in 'findhelitrons' command that was writing the coordinates in the wrong order for elements
on the minus strand.
- Bug fix in the 'make_exemplars' method for the LTR-RT classification, which was writing empty files.
- Standardize IDs with other methods when creating LTR-RT exemplars.
- Add debug option for 'findhelitrons' command to optionally show external commands.
0.02.4 04/07/16 Vancouver, BC
- Add method to standarize FASTA identifiers between GFF and FASTA files for the classifyltrs command.
- Write the element name to the output file of the ltrage command instead of the tree file name. Now the FASTA
identifier from the combined families file will be written to the ltrage output.
- Add LTR family name to GFF output of classifyltrs command.
- Remove the generation of separate GFF files for each LTR superfamily as this information can be easily
taken from the annotated GFF file.
- Bug fix for --clean method of ltrage method. A class attribute was missing for this option so it would have
never worked previously.
- Disable building GUI apps for EMBOSS, which are not used internally.
- Add Perl import for configuration (Net::FTP) which is not available in core for older Perls.
0.02.3 04/04/16 Vancouver, BC
- Bug fix for getting URL of deps from sourceforge (which has changed).
0.02.2 03/17/16 Vancouver, BC
- Fix GO term IDs modified incorrectly during version change.
- Add HMMERv3 to configuration to avoid version issues (HMMERv3 from package manager is too old)
and to simplify installation for the user.
0.02.1 03/17/16 Vancouver, BC
- Changes to fix threading issue with ltrage command.
0.02 03/15/16 Vancouver, BC
- Improve annotation of unclassifed LTR elements.
- Add option for installation/configuration in any location.
- Add configuration for 'findltrs' command to facilitate finding elements in a broader number of species.
- Add sighandler for BLAST processes.
- Add separate Install/Config classes for easily getting configuration information.
- Add options for controlling the thresholds of matches when annotation LTR elements.
- Add debug option for seeing the external commands that were executed.
- Add 'development' option for skipping long-running commands that require software that I cannot distribute.
- Set up Travis-CI for automated testing. This works efficiently because we are now using pre-built binaries and a pre-configured
Tephra root directory.
- Add option for substitution rate for ltrage command.
- Add multiple options for controlling the stringency of matches that are counted as real for the sololtrs command.
- Add output of a single collated report from ltrage command instead of individual reports for each element.
- Change input for findnonltrs command from directory of individual sequences to the more convenient multi-FASTA format.
0.01 12/24/15 Vancouver, BC
- Initial release.