Skip to content

Hybrid text compression pipeline using Burrows Wheelers Transform, Huffman encoding, Run-length encoding, Lempel-Ziv-Welch compression and Delta encoding

License

Notifications You must be signed in to change notification settings

IAMIQBAL/Hybrid-Text-Compression

Repository files navigation

Hybrid Text Compression


Introduction

Hybrid Text Compression (HTC) uses different techniques to achieve maximum text compression (lossless). It makes use of LZW, Huffman, Run length encoding and Burrows Wheeler Transform to compress different types of files.


Compression Algorithms


LZW

It is a dictionary based technique for compressing data created by Abraham Lempel, Jacob Ziv, and Terry Welch. It is the improved version of LZ78.


HUFFMAN Code

Huffman code is a type of prefix code which is used for lossless data compression. It was developed by David A. Huffman in 1952. It is an entropy/frequency encoding method.


Burrows Wheeler Transform

The BWT is a block-sorting compression algorithm. It rearranges a string into runs of similar character. It was invented by Michael Burrows and David Wheeler in 1994. It is used in Bio-informics. In Next Generation Sequencing, DNA is fragmented into small pieces of which first few bases are sequenced, yielding several millions of reads each 30 to 500 base pairs(“DNA Characters”) long.


Run Length Encoding

It is a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as original run.


Tools and Dependencies

  1. C++
  2. Python 2.x/3.x
  3. Matplotlib C++ API
  4. Boost for reading/writing binary data (in actual/raw bits)

Compiling

git clone https://github.com/IAMIQBAL/Hybrid-Text-Compression
cd Hybrid-Text-Compression
pacman -S boost
g++ -o main HybridCompressor.cpp
./main

Tests

We have written a test class (tests.cpp) which can be used to check the compression ratio and time taken on a scatter plot. The class uses Matplotlib’s C++ Library to plot the scatter plot. The Tests are as follows:

Note: Red = .json | Green = .txt | Blue = .xml, .html

1. Test for LZW


LZW

2. Test for Huffman


LZW


3. Test for RLE + LZW


LZW

4. Test for BWT + RLE + LZW


LZW

5. All Tests


LZW



Note:

Text, json, html and other formats data have been used for testing purposes.

About

Hybrid text compression pipeline using Burrows Wheelers Transform, Huffman encoding, Run-length encoding, Lempel-Ziv-Welch compression and Delta encoding

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages