Project Type | Bachelor Project |
Project Name | Efficient code for (de)compression |
Department of Project | Chair for Algorithms and Data Structures |
Chair Professor | Prof. Dr. Hannah Bast |
Duration | 21.10.2015 - 23.1.2016 |
Supervisor | Björn Buchhold |
Participant | Zhiwei Zhang |
Department | Computer Science |
Univesity | University of Freiburg |
In this project I implemented the Elias-Gamma algorithm, the Elias-Delta algorithm, the Golomb algorithm, the Variable-Bytes algorithm and the mainly optimized decompression function of the Simple8b algorithm and at the same time compared them to find out the "best appropriate algorithm(s)" for the search engine Broccoli.
Our semantic search engine Broccoli and its successor (under development) is closely related to compressed lists a lot. Current implementations have high speeds but are still far from optimal and have never been compared with other available compression schemes. Goal of the project is to tune the current (C++) implementation for speed, to evaluate the literature for promising alternatives and to compare them.
Write ${\color{blue}\lfloor\log_{2}{x}\rfloor}$ zeros, then ${\color{blue}x}$ in binary
like:
$\\{\color{blue}1}\rightarrow{\color{green}1}\\{\color{blue}2}\rightarrow{\color{red}0}{\color{green}10}\\{\color{blue}3}\rightarrow{\color{red}0}{\color{green}11}\\{\color{blue}4}\rightarrow{\color{red}00}{\color{green}100}\\\vdots\\{\color{blue}10}\rightarrow{\color{red}000}{\color{green}1010}$
Code for ${\color{blue}x}$ has a length of exactly ${\color{blue}2\cdot\lfloor\log_{2}{x}\rfloor+1}$ bits.
Write ${\color{blue}\lfloor\log_{2}{x}\rfloor+1}$ in Elias-Gamma, followed by ${\color{blue}x}$ in binary but without the leading 1
like:
$\\{\color{blue}1}\rightarrow{\color{green}1}\\{\color{blue}2}\rightarrow{\color{red}010}{\color{green}0}\\{\color{blue}3}\rightarrow{\color{red}010}{\color{green}1}\\{\color{blue}4}\rightarrow{\color{red}011}{\color{green}00}\\\vdots\\{\color{blue}10}\rightarrow{\color{red}00100}{\color{green}010}$
Elias-Delta is also prefix-free and the length of the code length is ${\color{blue}\lfloor\log_{2}{x}\rfloor+2\log_{2}{\log_{2}{x}}+O(1)}$ bits.
Comes with an integer parameter ${\color{blue}M}$, called modulus.
Write ${\color{blue}x}$ as ${\color{blue}q\cdot M+r}$, where ${\color{blue}q=\frac{x}{M}}$ and ${\color{blue}r=x \mod M}$
The code for ${\color{blue}x} is then the concatenation of:
${\color{blue}q}$ written in unary with ${\color{blue}0}$s.
a single 1 (as a delimiter).
${\color{blue}r}$ written in binary.
$\\M=16\\{\color{blue}1}\rightarrow{\color{blue}\underline{1}}{\color{green}0001}\\{\color{blue}70}\rightarrow{\color{red}0000}{\color{blue}\underline{1}}{\color{green}0100}$
Idea: use whole bytes, in order to avoid the (expensive) bit fiddling needed for the previous schemes.
Use one bit of each byte to indicate whether this is the last byte in the current code or not.
$\\x=521={\color{red}4}\cdot128+{\color{red}9}\\\text{VB-code for x: }{\color{green}\underline{1}}{\color{red}0000100}\:{\color{green}\underline{0}}{\color{red}0001001}$
Selector value | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
Item width | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 12 | 15 | 20 | 30 | 60 |
Group size | 240 | 120 | 60 | 30 | 20 | 15 | 12 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
Wasted bits | 60 | 60 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
$\\\text{data list: }888\,{\color{blue}56}\,{\color{green}1}\,{\color{red}0}\,{\color{cyan}0}\,{\color{magenta}0}\\\text{Simple8b-Code: }\\ 1101111000\,{\color{blue}0000111000}\,{\color{green}0000000001}\,{\color{red}0000000000}\,{\color{cyan}0000000000}\,{\color{magenta}0000000000}\,\underline{1010}$
There are several known compression algorithms such as Elias-Gamma, Elias-Delta, Golomb and Variable-Bytes. And there is also a new compression algorithm Simple8b. My supervisor has already implemented the Simple8b algorithm, and what I need to do was to find a way to optimize the implementation and to achieve a shorter compress time as well as shorter decompress time.
First of all I wrote three kinds of functions:
In our case the real data contain a lot of continuous zeros.
Then I implemented all the algorithms except Simple8b in two different ways in C++:
So with the comparison we could directly observe how much more efficient is the code using pointer than the code using vector.
Moreover I modified the compression function and decompression function.
For the decompress function I modified it following the ideas, suc as:
After the modifying I produced three versions of the decompresssion function.
For the compression function which provided by my supervisor I have changed it in several aspects and produced only one version of Simple8b algorithm.
Afterwards I run the tests of simulation on all the algorithms. There is a difference between Elias-Gamma, Elias-Delta and other algorithms. Without the possibility of compressing zero in Elias-Gamma and Elias-Delta I can only run the non-zero data using the two algorithms. Since the running time of Elias-Gamma and Elias-Delta is very long and our real data contain a lot of zeros, Elias-Gamma and Elias-Delta are not important to us. Due to the long running time, Golomb algorithm is also not important.
At last I tested only the Variable-Bytes and all the versions of Simple8b algorithms on the real data.
I tested all the algorithms in 3 aspects: compress time, decompress time, used bytes.
In the case of uniform distribution and the combination of uniform distribution and geometric distribution the compress time of Variable-Bytes algorithm is the shortest. In this case the Golomb algorithm has used least bytes. Significantly, the used bytes of Variable-Bytes algorithm are less than the Simple8b algorithm.
The decompress time of Simple8b algorithm is the shortest in all the cases.
In the case of geometric distribution the Simple8b algorithm has the same compress time with Variable-Bytes, which is the shortest in all the algorithm except Elias-Gamma and Elias-Delta. In this case the used bytes are the least.
I only tested the Variable-Bytes algorithm and all the versions of Simple8b algorithm in the same 4 aspects as in simulation.
There are six different data types: docids.ent, docides.noent, scores.ent, scores.noent, wordids.ent, wordids.noent.
In data types docids.ent, docides.noent, scores.ent, wordids.ent and wordids.noent the Variable-Bytes algorithm has the shortest compress time.
The Simple8b algorithm with the optimal decompress function has the shortest decompress time in data types docids.ent, docides.noent, wordids.ent and wordids.noent. And the Simple8b algorithm uses less bytes than Variable-Bytes algorithm in data types docids.ent, docides.noent, scores.ent, scores.noent and wordids.noent.
The optimal compression function in Simple8b algorithm (Simple8bCode.h) has the shortest compress time in the scores.noent data type. Significantly, the optimal Simple8bCode has always the shorter compression time and decomression time than the old Simple8bCode from Björn.
In the average case in all data types Variable-Bytes is more efficient than Simple8b in the aspect of compress time, and optimal Simple8b is more efficient than Variable-Bytes in the aspect of decompress time. Moreover the optimal Simple8bCode(Simple8bCode.h) has the shortest compression time and shortest decompression time among all the variants of Simple8b algorithm.
Description | Data |
---|---|
The Elias-Delta algorithm using pointer. | EliasDeltaEncodingPointer.h |
The Elias-Delta algorithm using vector. | EliasDeltaEncodingVector.h |
The Elias-Gamma algorithm using pointer. | EliasGammaEncodingPointer.h |
The Elias-Gamma algorithm using vector. | EliasGammaEncodingVector.h |
The Golomb algorithm using pointer. | GolombEncodingPointer.h |
The Golomb algorithm using vector. | GolombEncodingVector.h |
The one variant of the Simple8b algorithm using pointer. | Simple8bCode.h |
The one variant of the Simple8b algorithm using pointer and containing several object methods. | Simple8bCodeClass.h |
The other six variants of the Simple8b algorithm using pointer. | Simple8bCodeNew.h |
The version of the Simple8b algorithm using pointer from Björn Buchhold. | Simple8bCodeOld.h |
The Simple8b algorithm using vector. | Simple8bCodeVector.h |
The Variable-Bytes algorithm using pointer. | VariableByteEncodingPointer.h |
The Variable-Bytes algorithm using vectorß. | VariableByteEncodingVector.h |
All the files of the project. | code.zip |
Description | Data |
---|---|
All the test results of the project. | result.zip |
Article | Origin | Author |
---|---|---|
Index compression using 64-bit words | Softw. Pract. Exper. 2010; 40:131–147 | Vo Ngoc Anh and Alistair Moffat |
lecture-04.pdf of Information Retrieval 2015/2016 | 11-14 | Prof. Dr. Hannah Bast |