-
Notifications
You must be signed in to change notification settings - Fork 4
/
change_log.txt
1353 lines (944 loc) · 67 KB
/
change_log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Change Log:
This document lists all changes and refactoring made that either:
add new features, break the old API or fix known bugs. There will
often be many more source code changes that aren't listed if they don't
change the behavior in any public API classes. The order of changes is mostly
chronological so the most important changes may not always be ordered first.
================
Jillion 6.0.3
================
Bug Fixes
-----------
1. GappedSequenceBuilder now allows insertions at the end of reference sequences
API Changes
-----------
1. ResidueSequenceBuilder - added extra insert/append/prepend methods that were on sub-interfaces
2. GappedSequenceBuilder now works on any `ResidueSequence` not just `NucleotideSequence`s.
3. Added interface ReverseComplementable to NucleotideSequence
4. added interface Complementable to Nucleotide
5. added ResidueSequence#reverseIterator() and ResidueSequence#computeUngappedSequence()
6. added ResidueSequenceBuilder#appendGap()
7. GappedReferenceBuilder changed to use any ResidueSequence not just NucleotideSequence.
8. Cigar.Builder now has a buildMerged() method which combines consecutive CigarOperations.
For example `3M3M` will be built into a Cigar String of `6M` using buildMerged.
================
Jillion 6.0.2
================
Bug Fixes
---------
1. ACGTN and ACGT Only NucleotideSequence#getNumberOfGapsUtil was returning wrong value.
API Changes
-----------
1. Added ResidueSequence#getUngappedOffsetForSafe() and ResidueSequence#toUngappedRangeSafe()
which will not throw index out of bounds exceptions if given parameters go beyond sequence length.
2. Added new class SamVisitorFunctions which has Factory Methods for easy visitor implementations.
3. added INucleotideSequence#getLeftFlankingNonGapOffset() and Right flanking
offset and new expanding and contracting flanking Range.
4. added INucleotideSequence#createRightFlankingNonGapIterator(start)
and INucleotideSequence#createLeftFlankingNonGapIterator(start)
5. Added new FastqDownsampler interface to downsample fastq files and FastqDownsamplers class
with various algorithm implementations.
6. FastqWriter#write(FastqRecord[]) and FastqWriter#write(FastqRecord[], begin, end) methods to bulk write arrays.
7. Added FastqParser.iterator() and FastqParser.iterator(FastqVisitorMemento) which returns
new class FastqSingleVisitIterator.
8. FastqSingleVisitIterator a new Iterator that lets user visit fastq record one record at a time controlling
when the next call to visit will occur.
9. New FastqDownSampler interface with implementations in FastqDownSamplers.
Works on single fastq files and paired-ends fastq files.
10. added PeekableIterator#advanceIf( predicate) and PeekableIterator#advanceWhile( predicate)
11. SamTransformationService and SamGappedReferenceBuilderVisitor now can take provided NucleotideDataStores
for the references and will lazy load making the gapped refererences when it first encounters a read that
aligns to that reference. This allows mapping parts of the human genome without haivng to load the whole thing!
12. Added RangeMap#forEach
13. Moved NucleotideSequence#findMatches() method to new `MatchableSequence` interface
13. AssemblyTransformer#refOrConsensus() now passes an `INucleotideSequence<?,?>` instead of `NucleotideSequence`
14. added INucleotideSequenceBuilder#toBuilder(initialCapacity) for when you want to create
a new builder with the same sequence but with larger capacity then the current seq length.
14. GappedReferenceBuilder, SamAlignmentGapInserter and related classes now use INucleotideSequence interface
instead of NucleotideSequence class
15. Added Some Jackson annotations to NuclotideSequence and Direction so Jillion can be used
to read and write JSON and YAML formatted data with Jackson without needing additional software.
16. SamTransformationService now has INucleotideSequence and INucleotideSequenceBuilder generics added to the class
signature. These generics refer to the type of the REFERENCE used. Constructors made private and new equivalent #create() factory methods
should now be used which handle the creation of the new generics added.
17. Added additional factory methods to SamTransformationService to use different reference Datastores other than
normal fasta files.
18. SamRecordFilter#ungappedReferenceDataStore() now takes a DataStore<INucleotideSequence> instead of NucleotideFastaDataStore
Performance Improvements
------------------------
1. Some internal methods implementations were rewritten to be easier to maintain and improve performance.
2. performance improvements to calculate flank offsets. This improves performance of applications
that use Jillion with general assembly/gap heavy calculations by about 10%.
3. GappedReferenceBuilder which is used by Sam and Cas Assembly Transformer and Gapped Reference Builders
will now use a sparse matrix to keep track of gap insertions if the input reference is very large
(currently > 1Mbps)
================
Jillion 6.0.1
================
Bug Fixes
---------
1. Changed VariantNucleotideSequence#getTriplets() to examine the underlying read combinations even
if only one slice has variants. 6.0 would only do this possibly more computationally intensive operation if
multiple slices had variants. This fixes a bug where a mis-assembled read that doesn't span the whole codon
should not be included as a variant triplet.
================
Jillion 6.0
================
Performance Improvements
------------------------
1. Performance improvements on uncompressed NucleotideSequences if they are only ACGT or ACGTN
2. Performance improvements parsing and iterating over large Fasta files
3. Improvements for various internal functions that used gap offsets that previously used boxed
Integers but could be replaced with primitive int iterators.
New Features
------------
1. Moved to Java 11
2. Added lombok support
3. Added Vcf support
4. Added ThrowingSupplier
5. Added Support for XZ compressed files in InputStreamSupplier
6. New methods on NucleotideSequence to get ranges of Ns and Percent Ns.
7. Added Tar Support for InputStreamSupplier
8. FastaFileParser and FastqFileParser can now correctly parse compressed archive formats such as "tar.gz".
Assuming we only want the first file entry to be parsed.
This means that fasta and fastq visitors and datastores will now seamlessly works on tar.gz files.
9. InputStreamSuppliers of formats with multiple entries can read a specific entry instead of just the first.
API Changes
-----------
1. ResidueSequence is now Comparable and the default implementation compares toString() values.
2. new NucleotideSequence creation methods that take single Nucleotide objects.
3. ReferenceMappedNucleotideSequence now has a new method computePolymorphisms() which is similar to
the already existing method getDifferenceMap() except it goes further and denotes insertions vs deletions
and supports consecutive differences grouped together.
4. new method NucleotideSequence#isAllGapsOrBlank()
5. new method NucleotideSequence#isAllNs()
6. new method NucleotideSequenceBuilder#ungap(Predicate<Range>) which will only ungap the
passed in gap ranges if they pass the predicate.
7. new factory method NucleotideFastaRecord.of(File) will parse the given fasta file and return the first
record.
8. New factory method NucleotideFastaRecord.createNewIteratorFor( File) will parse the given fasta file and return
a StreamingIterator to iterate over each record.
9. New methods on TranslationTable to take more options a new TranslationOptions object was created to help reduce
the explosion of new methods to handle every combination of flags.
10. New methods on NucleotideSequence to get ranges of Gaps
11. New feature to InputStreamSupplier allowing nested decompression to support getting the uncompressed stream
from an tar.gz record for example.
12. new method InputStreamSupplier#get(InputStreamSupplierOptions) which allows for more easily setting
the different possible options to fetching an inputStream now including start/length and nested decompression
without having the number of methods explode. All previous get() methods with the the different parameters
are still present for backwards compatibility.
13. New InputStreamSupplierRegistry to add custom InputStreamSupplier implementations at runtime. Implementations
must implement the new org.jcvi.jillion.spi.io.InputStreamSupplierFactory interface.
14. New constructors NucleotideSequenceBuilder( NucleotideSequence, Range), NucleotideSequenceBuilder( NucleotideSequence, Range...) and NucleotideSequenceBuilder( NucleotideSequence, Iterable<Range>)
to support more efficient creation of builders that only contain partial ranges of a sequence
in a more efficient manner than performing multiple trim operations.
15. New method NucleotideSequenceBuilder#append(NucleotideSequence, Range) to append sequence in
a more efficient manner than performing trim operation.
16. New method ResidueSequence#hasAmbiguities()
17. New method ProteinSequence#computePercentX()
18. added query shift amount to pairwise alignment builders.
19. added new helper method DataStoreUtil#asDataStoreEntryIterator( StreamingIterator<T>, Function<T, String>)
to wrap a StreamingIterator into a DataStoreIterator
20. SplitFastaWriter objects are now synchronized by default.
21. FastaFileDataStoreBuilders now have onlyIncludeIds(Set<String>) which similar to filter( Predicate<String> )
except we know how many ids there should be so we can use this information in ITERATION_ONLY datastore implementations
to exit the parsing early if we've already found all the ids we care about but have not yet parsed the file.
22. New method ProtienSequenceBuilder#copy(Range) to match similar method in NucleotideSequenceBuilder
23. New method NucleotideSequence#computePercentGC()
24. Added Range#startsAfter( Range) and Range#endsAfter
25. Added method NucleotideSequence#hasGaps() and ProteinSequence#hasGaps()
26. ProteinPairwiseSequenceAlignment now extends ProtienSequenceAlignment
27. NucleotidePairwiseSequenceAlignment now extends NucleotideSequenceAlignment
28. SequenceBuilder now has a delete(Range...ranges) with a varargs of Ranges that will handle deleting
multiple Ranges at once correctly. The method is implemented with a default.
29. SingleThreadAdder is now comparable and implements equals and hashcode
30. ArrayUtil IntegerArrayList now has new method intIterator()
which returns an PrimitiveIterator.OfInt introduced in Java 8.
31. TranslationTables now also translate sequences with Uracil.
32. Cigar.Builder now has a trim(Range) method to add further trim the sequence with soft clips.
33. Added Cigar#toBuilder() method.
34. Ranges methods now take Collection<? extends Rangeable> instead of Collection<Range> for greater usage.
35. Range#complement(List<Range>) and Range#union(List<Range>) are now
Range#complement(List<? extends Rangeable>) and Range#union(List<? extends Rangeable>) for same reason.
36. New method Cigar.Builder#trim(Range) will update a Cigar to clipping operations to
make the Cigar String a soft clip beyond the given valid range.
37. New VCF Domains Specific Language (DSL) for creating VCF files.
38. Added Reserved VCF Info and Filter objects based on VCF 4.3 spec.
39. Added new DecodingOptions class to use inside NucletoideSequenceBuilder to add more configuration options
for invalid character handling and other common string manipulation such as making all ambiguities Ns.
40. GrowableXArrays added replaceIf(predicate, value)
41. SamRecord and related objects are now Serializable
42. new class SamAlignmentGapInserter which performs the extra gap insertions for read alignments to convert SAM/BAM alignments into proper aligned contigs.
43. Added #toArray() methods to NucleotideSequenceBuilder and ProteinSequenceBuilder.
44. Added #getNumberOfXs() to ProteinSequence
45. Added RangeCollectors class
46. Added Sam and Bam parser visitor options to visit multiple Ranges for a given reference.
Previously could only visit one Range at a time so multiple Ranges would have required multiple parses.
47. Added Range#intersectsOrAbuts(Range) which returns a boolean.
48. Added SingleThreadAdder#set(long)
49. Added new class MultipleNucleotideFastaFileDataStore
50. Added new IlluminaUtil#IlluminaName Matcher
51. Added SamParserFactory#Parameters class and builder for parsing options including whether to use an
index file or not even if present (previously always used index which is now default).
API BREAKING CHANGES
--------------------
1. Requires Java 11
2. TranslationVisitor will now visit all Codons not just the ones found between start and stop. This
is to support translating partial sequences where the start codon is missing.
3. TranslationVisitor now has new method visitVariantCodon(long nucleotideCoordinate, List<Codon> codons)
to support variant sequences
4. TranslationVisitor methods now have an additional long parameter to provide the start and
end coordinates (inclusive) of the nucleotides that contributed to the Codon.
5. FastaWriter#adapt() adapter parameter now takes a ThrowingTriConsumer that consumes
the id, adapted sequence and comment instead of returning a new FastaRecord. If the adapter
decides not to pass on the adapted sequence to the delegate, the implementation should not call
the consumer (previously it returned null). this should improve performance and not require
the FastaRecord object to be created as sometimes some implementations don't have easily
accessible constructors. This should also allow for chaining of consumers.
6. SamRecord#getNextOffset() is now called SamRecord#getNextPosition()
7. AssemblyTransformationService#aligned() and #unaligned() methods now have additional parameter `Object readObject` which is the
actual read object the transformer is transforming for you in the event you need to query the object directly (by downcasting)
to get additional information in your transformer.
8. BamFileParser will now call visitHeader() even when just parsing a specified Range
Bug Fixes
---------
1. Building a ReferenceMappedNucleotideSequence from a NucleotideSequenceBuilder
with compression turned off no longer throws a ClassCastException.
2. ProteinSequenceBuilder#copy() and #copy(Range) now correctly account for gaps and ambiguities.
3. FastaWriters can now have their close() method called more than once without throwing an error.
4. Bam parsing only selected regions no longer throws NullPointerException on unmapped reads.
5. Bam Indexing improvements handling incorrectly formatted unaligned reads.
6. Fixed Serialization issues from caching performance improvements introduced in 5.3
7. Fixed SamRecordFlags#remove() methods which would accidentally ADD the flag if it wasn't already present.
8. NucleotideSequenceBuilder won't throw an exception when using a reference AND later inserting bases. As of now inserting
bases will clear the reference field to make it a non-reference based sequence.
9. SamRecordBuilder now initializes flags correctly.
10. Virtual Offsets inside indexed bam files is now correctly computed even when jumping to particular index.
11. Truncated Bam files are now detected if file ends before index says it should.
================
Jillion 5.3
================
New Features
------------
1. Kmer Support - A new Kmer class was added. Each Kmer instance has the kmer sequence
and the offset it came from. NucleotideSequence and ProteinSequence now
have new methods Stream<Kmer> kmers( int k) and Stream<Kmer> kmers( int k, Range r)
to get a Stream of all the kmers of size k from either the whole sequence or a subRange.
2. New Simplified way to read and write basic bioinformatics file formats and reduced boilerplate code. New Classes and static methods
on interfaces were added to turn common usecases into one single lines of code. For example, iterating over a
the records in a fastq file can now be done with a single line of code to get back a ThrowingStream<FastqRecord>.
Previously, a FastqFileDataStoreBuilder had to be created, built, and the returned datastore had to get either
its streamingIterator method called or stream records method called. All that boiler plate is no longer required to be written
by the user. New classes and methods detailed in API Changes.
3. New static factory methods to some trimmer classes to make trimming Traces easier will allow
the QualityTrimmer or NucleotideTrimmer implementations to take the entire trace object as input instead of
trace.getQualitySequence() and/or trace.getNucleotideSequence(). Making it easier to read and write.
4. Added new adapt( Function<Fastq, Fastq>) method to FastqFileWriterBuilder and static adapt method to FastqWriter that
can modify a FastqRecord before writing it. Useful for abstracting away changing the record ids or performing additional trimming.
5. FastqWriterBuilders and FastaWriterBuilders constructors that take File will now parse the output file's
extension and if it's "gz" or "zip" will automatically compress the output accordingly. Currently does not handle nested
compressions or tar but those may be supported in future versions.
6. NucleotideSequence now supports Uracil. It is possible to also have sequences with both Ts and Us since some
therapeutics cataloged by the FDA have such sequences.
API Changes
-----------
1. ResidueSequence - Added more Generics to the the class signature to specify
the sub-interface and builder classes used. This change shouldn't
affect normal use of these classes but will cause some incompatibility
if you implement your own implementations of Sequences.
2. NucleotideSequence and ResidueSequence now have emptyBuilder() and emptyBuilder(int capacity) methods
that will return new empty NucleotideSequenceBuilders and ResidueSequenceBuilders respectively.
3. StreamingIterator and DataStore stream() methods now return a new ThrowingStream which has extra methods
that can accept functions/consumers that throw checked exceptions. These exceptions are then propagated
up without having to wrap them in runtime exceptions.
4. SamValidationException now extends IOException instead of just Exception. This simplifies catch blocks
and makes it work better with ThrowingStream.
5. New Pair utility class was added to make returning 2-tuples easier. Pair is Closable so it can be used inside
try-with-resource when it is closed, it will try to close the elements in the pair if they are closable.
6. Created new FastqFileReader class with several methods to parse a fastq file and get back a Results
object (subclass of Pair) that has both the ThrowingStream<FastqRecord> and the FastqQualityCodec that was used.
This removes the need to make a datastore and have to remember to specify DataStoreProviderHint.ITERATION_ONLY.
7. Created new Trimmer<T> interface which is now the parent interface to QualityTrimmer and NucleotideTrimmer.
8. Modified FastaWriterBuilder implementations so that all methods on them return the actual concrete Builder type
instead of the abstract parent builder class. This lets us chain multiple class specific methods which
wasn't possible before.
9. Added FastaWriterBuilder.sort(Comparator) method which uses default in memory cache size currently set to 1024 records.
10. Created new NucleotideFastFileReader class with several methods to parse a fastq file and get back a Results
object (subclass of Pair) that has both the ThrowingStream<NucleotideFastaRecord> and the FastqQualityCodec that was used.
This removes the need to make a datastore and have to remember to specify DataStoreProviderHint.ITERATION_ONLY.
11. Created NucleotideFastaFileDataStore interface which has a getFile() method all file based datastores now implement this interface.
12. Added static helper factory methods to NucleotideFastaFileDataStore to simplify creating datastores using Builders with one liners.
13. Made FastqRecordBuilder an interface. There are now a few implementations but they are all package private use
FastqRecordBuilder.create(...) methods to create new instances or the new FastqRecord.toBuilder() method to get the particular
implementation best for that record.
14. Added getters and setters to FastqRecordBuilder.
15. Added trim(Range) method to FastqRecordBuilder to simplify one of the most common modifications.
16. Added DataStore.forEach( BiConsumer<String, T>) that will call the given consumer once for each record in the datastore.
This method will often be more efficient than using Iterators.
17. FastqRecord.getAvgQuality() now returns an OptionalDouble instead of a double.
If the sequence is empty, then the returned Optional is also empty.
Previously an empty string threw an Arithmetic error.
18. QualitySequence.getMinQuality() and getMaxQuality() now return an Optional<PhredQuality> instead of a PhredQuality.
If the sequence is empty, then the returned Optional is also empty.
Previously an empty sequence returned null.
19. QualitySequence.getAvgQuality() now returns an OptionalDouble instead of a double.
If the sequence is empty, then the returned Optional is also empty.
Previously an empty string threw an Arithmetic error.
20. Added new methods to Range forEachValue(LongConsumer) and forEachValue(CoordinateSystem, LongConsumer)
that use a primitive consumer of longs. This should be used in preference to Iterable.forEach(Consumer)
which autoboxes and can only use zero based (array offset) coordinates.
21. Renamed SamRecordFlags to SamRecordFlag (no ending "S").
22. Created new SamRecordFlags (with "S") which stores flag bits as int. This is now cached and used as a flyweight for
better performance over storing duplicate Set<SamRecordFlag> over and over again.
23. Added new SamParserOptions class to specify how to parse the sam/bam file including which
reference, alignment range and if to add memento support or not.
24. Nucleotide enum now has Uracil
25. NucleotideSequence now has new methods isDna() and isRna() if the sequence has exclusively Ts and Us.
Performance Improvements
1. QualitySequence.getAvgQuality()/ getMinQuality()/ getMaxQuality()
Most implementations now cache the computation and performs all the calculations at once.
Previously the computations were performed separately.
2. BAM reading performance improvements by caching and costly computations for sam record flags and sequence storage.
Benchmarkings show 30% performance improvements reading BAM files.
Bug Fixes
---------
1. Adapted FastqRecordWriter now fixed to actually write adapted record
2. ProteinSequenceBuilder ungap now correctly ungaps the sequence.
================
Jillion 5.2
================
New Features
------------
1. Added new method to FastqWriter to automatically trim given a Range.
This saves users the trouble of creating SequenceBuilders and trimming themselves.
2. Added new method to FastqRecord to get the average Quality of the quality sequence.
The default implementation calls getQualitySequence().getAvgQuality() but some implementations
use a more efficient version.
3. Added new QualityTrimmer SlidingWindowQualityTrimmer which acts like Trimmomatic's SLIDINGWINDOW option.
4. Added new convenience methods to NucleotideTrimmer and QualityTrimmer that take Builders. This is really useful
when performing multiple trimming operations in serial since some trimmers may be able to save CPU cycles
and work directly from the builders.
5. Added new TrimmerPipeline and TrimmerPipelineBuilder classes which can take multiple NucleotideTrimmers
and QualityTrimmers and combine the trimming results for you.
6. Added SamFileDataStore and SamFileDataStoreBuilder to finally provide a higher level API for
working with sam and bam files without needing to use a low level Visitor.
7. Added Optional<File> getFile() to FastqParser and refactored CasParser
implementations to make begin to make it easier to extend cas file parsing.
8. Add lambda hook to CasFileTransformationService to override how fastqDataStore is generated so
users could override to provide their own implementation.
9. Added new ConsensusCollectors class that can take Streams of various sequence inputs and compute a consensus.
10. Added new TraceDirPhdDataStoreBuilder class that can make a PhdDataStore implementation from a folder of sanger trace files.
11. AbiChromatogramParser - Added support for ABI 3500 abi files.
API Changes
-----------
1. Added Trace.getLength()
2. Added default methods to Rangeable for getLength() getBegin(), getEnd() and isEmpty() since
that is used the most don't have to always build a new Range object.
3. Added Range.Builder intersect methods
4. Changed TrimmerPipeline methods to be faster by making fewer Range objects and working off of Range.Builders instead.
5. Added new Range.toString() methods that take lambda expressions so users can make their
own toString implementations. Have several overloaded versions
* toString(RangeToStringFunction)
* toString(RangeToStringFunction, CoordinateSystem)
* toString(RangeAndCoordinateSystemToStringFunction)
* toString(RangeAndCoordinateSystemToStringFunction, CoordinateSystem)
to let users convert to different coordinate systems and to
include that coordinate system in the lambda expression or not.
6. Added toGappedRange( Range) and toUngappedRange( Range) to ResidueSequence
with default implementations and more efficient implementation when the codec
knows it doesn't have gaps. Changed AssemblyUtil to use that instead of its own implementation.
7. Added toUngappedRange( Range) to NucleotideSequenceBuilder
8. DataStoreException now extends IOException
9. Added new StreamingIterator.empty() method
Bug Fixes
---------
1. BlastParser - fixed bug in XML Blast Parser when it sometimes accidentally set percent identity to be (1 - percent identity).
================
Jillion 5.1
================
New Features
------------
1. Added new methods to FastaDataStore getSequence( id) which gets just the sequence
and is equivalent to get(id).getSequence().
2. Added new methods to FastaDataStore getSubSequence( id, offset) which gets just the sequence
starting from the given offset.
3. Added new methods to FastaDataStore getSubSequence( id, range) which gets just the sequence
that intersects the given range.
4. Added support for Fasta Index Files (.fai) files to NucleotideFastaDataStore.
The NucleotideFastaFileDataStoreBuilder object can now be given an fai file
or auto-detect one and use that to make a more efficient implementation
to be used with the new getSequence() or getSubSequence() methods.
5. Added support for writing Fasta Index Files (.fai) files to NucleotideFastaWriter using
the createIndex(true) method. This will make an additional file named $outputFasta.fai.
Supports normal, zipped and non-redundant fasta files.
6. Added new class FaiNucleotideWriterBuilder that can create new Fasta Index Files (.fai) for
existing fasta files. The builder object supports fully configuration of the fai to be written
including the output path, the end of line character, and the Charset.
API Changes
-----------
1. Created new abstract class AbstractReadCasVisitor which is now the parent class of AbstractAlignedReadCasVisitor.
The new class handles iterating over the input read files to link cas alignments to their read names, sequences and qualities.
Now you can extend that class if you want that extra information without realigning to gapped references.
2. Moved FastaUtil to internal package since it should not be used outside of Jillion classes. Heavily refactored it.
3. Improved Javadoc. Many more classes and methods now have javadoc. Hundreds of javadoc comments have been improved
to fix problems found by the javadoc: lint.
4. BlosumMatrices class added support for Blosum30 and 40.
5. Some classes that were in jillion.internal were moved to jillion.shared since all internal classes can't
be exported by OSGI. These classes should not be considered part of the public API and should only be for internal use.
6. FastqFileParser.canAccept() renamed to canParse() to match the other parsers.
Bug Fixes
---------
1. PositionSequence - Sanger Position Sequence.iterator(Range)
off by 1 bug fix did not include the last base in the range.
2. StreamingIterator - abstract class that many StreamingIterators extend to use background thread
to populate iterator has been improved to fix occasional dead lock issues if the background thread throws exceptions.
3. BlastParser - fixed bug in XML Blast Parser when it sometimes accidentally set percent identity to be (1 - percent identity).
================
Jillion 5.0
================
LICENSE CHANGE
Jillion 5 is now LGPL 2.1. Previous versions of Jillion are GPL 3 and will remain that way.
Jillion 5 now uses the same license as Bio* projects and commercial software may now
use Jillion's jar file in their software.
New Features
------------
1. Added LucyVectorSpliceTrimmer that performs vector splice trimming using
a simplified version of the algorithm that the TIGR program Lucy used.
2. Added new SplitFastaWriter and SplitFastqWriter classes which have 3 factory methods to make
different Writer implementations that split up writing records to different files using different
strategies. roundRobin(), rollover() and deconvolve() each method takes a lambda function
to create the new individual writers and deconvolve() takes a second lambda which determines which
output file the record will go to.
3. FastqWriterBuilder and FastaWriterBuilders for Nucleotide, Protein, Quality and Position files
can now sort records using a Comparator. Both in-memory only
and using temp files to sort all the records are supported. An additional overloaded
sort() method takes a File object that is the directory to create the temp files in
(default directory is System temp). Using the temp files to help with sorting
allows the writing very large sorted output files that would not have been able to all fit in memory.
4. FastqFileDataStoreBuilder and FastaFileDataStoreBuilders for Nucleotide, Protein, Quality and Position files
can now filter ids by Predicate<String>. Previously you had to implement the DataStoreFilter interface.
5. FastqFileDataStoreBuilder and FastaFileDataStoreBuilders for Nucleotide, Protein, Quality and Position files
can now filter records by Predicate<FastqRecord> and Predicate<FastaRecord> respectively.
This allows you to very easily include/exclude records from the DataStore using criteria
other than the record id. This also removes a lot of boilerplate code of iterating through the file
multiple times to make a second datastore of only the data you wanted.
For example, to make a NucleotideFastaDatastore where the sequences are all > 1000bp:
new NucleotideFastaFileDataStoreBuilder(fastaFile)
.filterRecords(record-> record.getLength() >1000)
.build();
6. Changed DataStoreFilter, ReadFilter and SliceElementFilter to now have the parent interface Predicate<T> and changed
All the APIs that use these classes to take a Predicate instead.
This lets you use lambda expressions anywhere a filter was used before which is much easier to read,
is fewer characters to write and allows filters to be more reusable. For example:
new ContigCoverageMapBuilder<>(contig)
.filter(read -> read.getDirection() == Direction.FORWARD)
.build()
will make a CoverageMap object that only contains forward reads.
7. Created new GenomeStatistics class which is a utility class for computing different
statistical measurements about genomes (for example N50). It uses the new Java 8
Collector interface. For example to compute the N50 of all the records in a Fasta file:
try(NucleotideFastaDataStore datastore = new NucleotideFastaFileDataStoreBuilder(fastaFile)
.hint(DataStoreProviderHint.ITERATION_ONLY)
.build();
Stream<NucleotideFastaRecord> stream = datastore.iterator().toStream();
){
OptionalInt n50Value = stream
.map(fasta -> fasta.getLength())
.collect(GenomeStatistics.n50Collector());
//return value is optional because there might not be any records!
if(n50Value.isPresent()){
System.out.println("N50 = " + n50Value.getAsInt());
}
}
8. Created new CoverageMapCollectors class which is a utility class for creating
Java 8 Collector objects that create CoverageMap objects. For example,
if you had a contig and wanted a coverage map of the alignment locations of
just the forward reads capped to a max of 200x coverage the code would look like this:
CoverageMap<Range> forwardCoverageMap200x = contig.reads()
.filter(read -> read.getDirection() == Direction.FORWARD)
.map(AssembledRead::asRange)
.collect(CoverageMapCollectors.toCoverageMap(200));
9. Performance improvements to Fastq file parsing. When not using Mementos
or DataStoreProviderHint.RANDOM_ACCESS_OPTIMIZE_MEMORY (which uses mementos)
parsing time is now improved by 400%!
10. Performance improvements of FastqFileParser and built in FastqWriter implementation for the most common
use-case of parsing a fastq file and writing out the FastqRecord instances as is to a different writer.
New internal classes are now used which don't convert the encoded quality strings into QualitySequence
objects unless getQualitySequence() is called. This takes up slightly more memory per record
which usually isn't an issue because most of the time the files are streamed as ITERATION_ONLY
and so the records will be GC'ed as soon as they are out of scope in the iterator.
When tested on large 25 million read fastq files from 1000genomes project, throughput improved by 25%.
11. Added new DataStoreFilters factory method: containedInDataStore(DataStore datastore) which will only accept ids that are
also contained in the given datastore.
12. Created InputStreamSupplier with support for normal files, zipped and gzipped files.
This lets users reparse with compressed files multiple times. Previously
compressed files either could only be parsed a single time via
InputStream constructors.
13. Performance improvements of parsing BAM files including now using the BAM index if present
to help skip over unnecessary parts of the file.
14. AceFileParser - more lenient Consensus Tag timestamp parsers to support CLC Workbench ace output
which doesn't follow the ace file spec regarding timestamp resolution.
15. Jillion 5 is now OSGI compliant and can now be used in an OSGI container.
All classes except for those under org.jcvi.jillion.internal.* are exported.
16. BtabWriter implementation created by the BtabWriterBuilder can now
format the dates differently by using alternate Locales.
17. Added new default method to SamAttributeValidator. thenComparing(SamAttributeValidator other) which returns
a new SamAttributeValidator that checks both validators in a chain and only passes if both validators pass
the attribute. Uses a similar construction to the new Java 8 Comparator.thenComparing(...) methods.
API Changes
-----------
1. Added Java 8 Lambda Support to many APIs.
2. Moved quality trimming classes that used to be in org.jcvi.jillion.core.qual.trim to new
org.jcvi.jillion.trim package. LucyQualityTrimmer moved to new org.jcvi.jillion.trim.lucy
3. Added new FastqFileDataStore interface which is a sub-interface of FastqDataStore and adds one method getQualityCodec()
which returns the FastqQualityCodec object that was used to encode all the fastq records in the file.
FastqFileBuilder objects have changed to return this new interface.
4. Added new method to Stream<T> Contig.reads() which returns a new Java 8 Stream of read of the appropriate type.
The Stream can then be used in any normal Java 8 Stream chain or as input to one of the new Jillion
Collectors described below.
5. Added StreamingIterator.toStream() method which returns a Java 8 Stream<T> to easily
convert from Jillion StreamingIterators to the new Java 8 API. The Stream still needs to be closed
so putting it inside a try-with-resource is still recommended.
6. Changed Position Fasta package to conform to the same API as the other Fasta packages for
nucleotides, proteins and qualities. There is now a PositionFastaFileDataStoreBuilder
to make the datastores with the same filter and hint methods as the other similar builders.
All the previous DataStore implementation classes that used to be public are
now package private. Please use the PositionFastaFileDataStoreBuilder only.
7. Made FastqFileParser class package private. Please use the new FastqFileParserBuilder object
instead which has more configuration methods to more easily create FastqParser objects.
This was to avoid the explosion of factory methods for FastqParser to handle all the
possible combinations of inputstreams, Files, compressed Files, comments on defline,
and multiline sequences.
8. Added new method FastqRecord.getLength() which is convenience method for FastqRecord.getNucleotideSequence().getLength()
but some implementations may use more optimized implementation. Added as default method.
9. Added new method FastaRecord.getLength() which is convenience method for FastaRecord.getSequence().getLength()
but some implementations may use more optimized implementation. Added as default method.
10. Moved experimental code from its own "experimental" and "experimental-test" folders to the
normal src and test folders.
11. Changed package names of the experimental code from org.jcvi.jillion_experimental.*
to org.jcvi.jillion.experimental.*
12. SWITCHED BUILD TOOL FROM ANT TO MAVEN. Use custom configuration to keep original
folder structure.
13. removed jodatime dependency from ConsedAssemblyTransformerBuilder which was accidentally
put in during testing to allow unit tests to use a fixed phred_date. This caused the jillion
jar to depend on jodatime when it should not have any dependencies other than JDK 8.
Replace jodatime code with equivalent Java 8 Clock object.
14. Changed FastqFileParser and FastqFileDataStoreBuilder to use those
InputStreamProvider objects instead of Function<File, InputStream> which was not only repetitive
but forced users to handle IOException themselves.
15. Changed FastqFileParser and FastqFileDataStoreBuilder constructors that take File objects
to delegate to InputStreamProvider.forFile( file) which handles the detection and decompression for you.
This means you can give the constructors
zipped or gzipped files and it will work as if it was uncompressed.
16. Added zip and gzip support to fastaFileParesr and the FastaFileDataStoreBuilders
18. Changed SamVisitor API to remove visitRecord(Callback, SamRecord) which was only called when visiting SAM files,
and not BAM files. Now all records visited will call
visitRecord(Callback callback, SamRecord record , VirtualFileOffset start, VirtualFileOffset end)
where the start and end parameters will now be null if it's a SAM file and non-null if it's a BAM file.
Previously start and end would never be null but only called when visiting BAM files. This lead to a lot
of confusion and duplicated code when dealing with both SAM and BAM files.
Changed SamParser API methods from canAccept() and accept(...) to canParse() and parse(...).
19. Added new method to the SamParser API void parse(String referenceName, SamVisitor visitor)
which will only visit the SamRecords in the file that map to the given reference. Sorted Bam parser implementations
can use the bam index if available to quickly seek right to the part of the bam file where the alignments for the
specified reference are stored.
20. Added new method to the SamParser API void parse(String referenceName, Range alignmentRange, SamVisitor visitor)
which will only visit the SamRecords in the file that map to the given reference and the alignment intersect the
given alignmentRange. Sorted Bam parser implementations can use the bam index if available to quickly seek right
to the parts of the bam file where the alignments for the specified reference are stored.
21. Added new factory methods to SamParserFactory: createFromBamIndex(File bam, File bamIndex) and
createFromBamIndex(File bam, File bamIndex, SamAttributeValidator validator) that use a more efficient
bam parser that uses the provided index file to randomly access alignment information.
22. Changed SamParserFactory.create(File) to have additional checks to see if the given file
is a coordinate sorted BAM file, and if it is, check to see if it also has a corresponding BAI
file and if it does, then use the new parser implementation that uses the index as if
createFromBamIndex(File bam, File bamIndex) was called.
23. Added new helper method SamRecord.getAlignmentRange()
24. Added support for SamVisitor Memento support with new SamParser.parse(visitor, SamVisitorMemento) method.
Previously you could create mementos but couldn't use them.
25. SamRecord.Builder is now pulled out into its own class SamRecordBuilder.
26. Refactored SamRecord from a class to an interface. The old SamRecord class is now package private.
All API methods in sam package now use the new SamRecord interface instead of the old class.
27. BtabWriter added new locale(Locale) method to change the Locale for the Date formatting.
If not called, the default Locale is used. Previous versions of Jillion always used the default Locale.
28. Created new SamAttributed interface which has the methods hasAttribute(...) and getAttribute(...)
SamRecord and SamRecordBuilder now both implement this interface.
29. Added additional parameter to SamAttributeValidator to add a SamAttributed instance. This will be
the source that the attribute is from. This allows new validators to be written to check other attributes
from the same source.
30. Created new SamAttributeValidator singleton class NoDuplicateSamAttribute that makes sure the given
SamAttributeKey for isn't already used by the record, which would be a violation of the SAM specification.
31. Added new default method to SamAttributeValidator. thenComparing(SamAttributeValidator other) which returns
a new SamAttributeValidator that checks both validators in a chain and only passes if both validators pass the attribute.
Uses a similar construction to the new Java 8 Comparator.thenComparing(...) methods.
Bug Fixes
----------
1. Generate 454 Universal Accession number did not
generate valid id if the location x,y coordinates were very small.
2. Bug Fix in SAM and BAM header writer which incorrectly wrote out the MD5 values of the references as "MD5"
instead of the actual md5 hash value.
3. Bug Fix in SAM and BAM header writer which incorrectly wrote out the URI path to the reference file to be the md5 value
instead of the actual path.
4. Bug Fix in BAM writer which incorrectly computed BAM bin.
5. Bug Fixes in BAM index writer which incorrectly computed BAM bin and intervals.
================
Jillion 4.2
================
New Features
------------
1. FastaFileParser and all Fasta Datastore implementations
supports non-redundant text fasta files like the ones described in
ftp://ftp.ncbi.nih.gov/blast/db/README</a>.
If non-redundant records are encountered, then the visitXXX methods will be called
in a way such that it will appear as if they were redundantly listed. The non-redundant
defline will be split and each identical sequence will be visited separately with each
of the many ids for it. Creating org.jcvi.jillion.fasta.FastaVisitorCallback.FastaVisitorMemento
are also non-redundant aware and will correctly only visit the subset of non-redundant records according to when
the memento was created.
2. AminoAcid - Added Pyrrolysine 'O' to AminoAcid class and AminoAcidSequence
as well as Blosum matrices.
Bug Fixes
----------
1. AlnFileParser - added lowercase basecall support which is used in MAFFT output.
Jillion can now successfully parse .aln files produced by MAFFT.
2. XMLBlastParser - If accession in subject is "No definition line found" will
use subjectDefline instead.
3. PrimerDetector - Added guard clause if input sequence is empty,
then empty collection of hits is returned.
API Changes
-----------
1. Renamed *FastaRecordWriter* to *FastaWriter* to make smaller class names
2. Renamed *FastqRecordWriter* to *FastqWriter* to make smaller class names
3. Blast- BlastHit now has subjectDefline field instead of
subjectDeflineComment which only used to have the comment.
4. BtabWriter - now puts alignment length in previously skipped column
================
Jillion 4.1 - internal release only
================
New Features
------------
1. Added support for AminoAcid ambiguity codes in AminoAcid class AminoAcidSequence
as well as Blosum matrices.
2. Added support for Amino Acid 'U' Selenocystenine.
API Changes
-----------
1. removed getNucleotideSequenceBuilder from AssembledReadBuilder since we want
to control all sequence manipulations directly to keep read ranges in sync.
2. Added method CoverageRegion.getLength() to avoid having to chain region.asRange().getLength()
3. Moved Frame class to residue package.
4. Added new class NucleotideSequencePermuter
5. Improved Javadoc
Bug Fixes
---------
1. Changed alignment resource loading from loading file to loading inputStream.
Getting the resource as a file doesn't work once Jillion has been jarred up.
2. Modified AceContigBuilder and AceReadBuilder to callback to it's parent
builder to update contig left and right if the read size changes.
3. added code to (Hopefully) stop validating the DTD which was
DDOS'ing NCBI when this code was used on the grid.
4. FastqFileParser - Bug fix for unindexed Casava 1.8 read ids.
5. AceContigBuilder - Bug fix for inserting bases in ace contig that extend beyond original consensus
================
Jillion 4.0 RC 5 - Added support for indexed BAM files and improvemented SAM/BAM API.
Bug fixes and performance improvements. Added more javadoc.
================
API Changes
-----------
1. Moved many previously public classes in SAM and BAM packages to be in "internal" packages.
Which are not intended to be used by external clients. This greatly simplifies the public facing API.
2. Added BAM index support for reading and writing. This includes support for BAM "metadata" records
which are not in the SAM specification but are created and used by both samtools and Picard.
3. Added FastqQualityCodec.getOffset() method to get the integer offset for the encoding. This will
return 33 or 64 depending on the implementation.
Bug Fixes
----------
1. Fixed bug in BAM VirtualFileOffset computation so it matches Picard. The bug was
related to computing the offset at a BGZF block boundary. Picard sets the offset
to the beginning of the next block.
Performance Improvements
------------------------
1. Improved BAM parsing and writing code to be 25% faster. Some of these improvements
might also improve cas file parsing.
================
Jillion 4.0 RC 4 - Added SAM/BAM and MAQ bfa and bfq support. Promoted pairwise alignment code out of experimental
and into production. Bug fixes and performance improvements.
================
API Changes
------------
1. Added new NucleotideSequenceBuilder constructor that takes a char[]
2. Added append/insert/prepend methods to NucleotideSequenceBuilder that take a char[]
3. Completely changed interfaces for parsing with Visitors.
Followed the AceHandler design but renamed methods.
XFileParser classes are now factories to create instances of XParser interface.
XParser has (usually) 3 methods: parse(XVisitor), parse(XVisitor, XMemento) and canParse().
parse() is the new name for accept(). canParse() returns a boolean to say
if a call to parse(...) will throw IllegalStateException.
This is because some implementations such as file parsers using an
inputStream can only call parse() once.
Any additional calls to parse(...) will fail since we can't always rewind the Stream.
The will decouple visiting from an actual file so we could visit off of other
types of objects (like Contig objects in tests).
4. Changed all class references from XFileParser to XParser objects.
5. Changed FastaFileBuilders to try to create input File's parent directory if does not exist,
this made the contructor throw IOException instead of FileNotFoundException.
6. Moved alignment packages out of experimental and into main. New package is org.jcvi.jillion.align.
Refactored pairwise alignment code to use more intent revealing PairwiseAlignmentBuilder
class with 2 static factory methods: createNucleotideAlignment( ..) and createProteinAlignment(..)
to handle all the messy generics. Builder has method to specify local vs global alignment
to hide algorithm name implementation details.
Added support for Protein Blast results as well as code to auto-detect
nucleotide or protein blast results to build the correct HSPs.
Renamed ScoringMatrix to SubstitutionMatrix. Created NucleotideSubsitutionMatrices utility classes.
Refactored matrix file parser classes into abstract class with 2 subclasses to handle nucleotide and protein matrices.
moved more classes into align package.
7. Renamed AminoAcidSequence to ProteinSequence.
All other public classes with AminoAcidSequence in their name is now ProteinSequence instead.
For example: AminoAcidFastaRecord is now called ProteinFastaRecord.
8. Added new methods to fastaWriterBuilders #lineSeparator(String) to change line separator from \n to
support windows and #allBasesOnOneLine() to force all data on one line instead of splitting the data onto multilines.
9. made CompactProteinSequenceCodec package private.
10. Changed cas parser to default to fasta format type for unknown file extension or file without extension.
Previously default was chromatogram. Chromatograms now must have either 'ab1', 'abi', 'scf' or 'ztr' file extensions.