Bachelor Project

Efficient Code For (De)Compression

- Zhiwei Zhang

Information

Project Type	Bachelor Project
Project Name	Efficient code for (de)compression
Department of Project	Chair for Algorithms and Data Structures
Chair Professor	Prof. Dr. Hannah Bast
Duration	21.10.2015 - 23.1.2016
Supervisor	Björn Buchhold
Participant	Zhiwei Zhang
Department	Computer Science
Univesity	University of Freiburg

Briefly Description of the Project

In this project I implemented the Elias-Gamma algorithm, the Elias-Delta algorithm, the Golomb algorithm, the Variable-Bytes algorithm and the mainly optimized decompression function of the Simple8b algorithm and at the same time compared them to find out the "best appropriate algorithm(s)" for the search engine Broccoli.

Our semantic search engine Broccoli and its successor (under development) is closely related to compressed lists a lot. Current implementations have high speeds but are still far from optimal and have never been compared with other available compression schemes. Goal of the project is to tune the current (C++) implementation for speed, to evaluate the literature for promising alternatives and to compare them.

Elias-Gamma

Write ${\color{blue}\lfloor\log_{2}{x}\rfloor}$ zeros, then ${\color{blue}x}$ in binary
like:

$\\{\color{blue}1}\rightarrow{\color{green}1}\\{\color{blue}2}\rightarrow{\color{red}0}{\color{green}10}\\{\color{blue}3}\rightarrow{\color{red}0}{\color{green}11}\\{\color{blue}4}\rightarrow{\color{red}00}{\color{green}100}\\\vdots\\{\color{blue}10}\rightarrow{\color{red}000}{\color{green}1010}$

Prefix-free, because the number of initial zeros tells us exactly how many bits of the code come afterwards.
Code for ${\color{blue}x}$ has a length of exactly ${\color{blue}2\cdot\lfloor\log_{2}{x}\rfloor+1}$ bits.

Elias-Delta

Write ${\color{blue}\lfloor\log_{2}{x}\rfloor+1}$ in Elias-Gamma, followed by ${\color{blue}x}$ in binary but without the leading 1
like:

$\\{\color{blue}1}\rightarrow{\color{green}1}\\{\color{blue}2}\rightarrow{\color{red}010}{\color{green}0}\\{\color{blue}3}\rightarrow{\color{red}010}{\color{green}1}\\{\color{blue}4}\rightarrow{\color{red}011}{\color{green}00}\\\vdots\\{\color{blue}10}\rightarrow{\color{red}00100}{\color{green}010}$

Elias-Delta is also prefix-free and the length of the code length is ${\color{blue}\lfloor\log_{2}{x}\rfloor+2\log_{2}{\log_{2}{x}}+O(1)}$ bits.

Golomb

Comes with an integer parameter ${\color{blue}M}$, called modulus.
Write ${\color{blue}x}$ as ${\color{blue}q\cdot M+r}$, where ${\color{blue}q=\frac{x}{M}}$ and ${\color{blue}r=x \mod M}$
The code for ${\color{blue}x} is then the concatenation of:
- ${\color{blue}q}$ written in unary with ${\color{blue}0}$s.
- a single 1 (as a delimiter).
- ${\color{blue}r}$ written in binary.

like:

$\\M=16\\{\color{blue}1}\rightarrow{\color{blue}\underline{1}}{\color{green}0001}\\{\color{blue}70}\rightarrow{\color{red}0000}{\color{blue}\underline{1}}{\color{green}0100}$

Variable-Bytes

Idea: use whole bytes, in order to avoid the (expensive) bit fiddling needed for the previous schemes.
Use one bit of each byte to indicate whether this is the last byte in the current code or not.

like:

$\\x=521={\color{red}4}\cdot128+{\color{red}9}\\\text{VB-code for x: }{\color{green}\underline{1}}{\color{red}0000100}\:{\color{green}\underline{0}}{\color{red}0001001}$

Simple8b

Each code is unsigned 64 bits in binary form.
and contains two parts:
- 4 bits selector which explains how many words this code contains.
- and 60 bits content.

There are 16 different selectors.

Selector value	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Item width	0	0	1	2	3	4	5	6	7	8	10	12	15	20	30	60
Group size	240	120	60	30	20	15	12	10	8	7	6	5	4	3	2	1
Wasted bits	60	60	0	0	0	0	0	0	4	4	0	0	0	0	0	0

We use the selector-0 express 240 continuous 0s and selector-0 express 120 continuous 0s.

like:

$\\\text{data list: }888\,{\color{blue}56}\,{\color{green}1}\,{\color{red}0}\,{\color{cyan}0}\,{\color{magenta}0}\\\text{Simple8b-Code: }\\ 1101111000\,{\color{blue}0000111000}\,{\color{green}0000000001}\,{\color{red}0000000000}\,{\color{cyan}0000000000}\,{\color{magenta}0000000000}\,\underline{1010}$

There are several known compression algorithms such as Elias-Gamma, Elias-Delta, Golomb and Variable-Bytes. And there is also a new compression algorithm Simple8b. My supervisor has already implemented the Simple8b algorithm, and what I need to do was to find a way to optimize the implementation and to achieve a shorter compress time as well as shorter decompress time.

First of all I wrote three kinds of functions:

the first kind of function could produce a list of given quantity of numbers between given minimize number and maximize number. In this list all the numbers conform to uniform distribution.
the second kind of function could also produce a list of given quantity of numbers. But in this list all the numbers conform to geometric distribution.
the third kind of function just combines the first two functions to simulate the real data.

In our case the real data contain a lot of continuous zeros.

Then I implemented all the algorithms except Simple8b in two different ways in C++:

one way is to use pointer as the list type.
and another way is to use vector as the list type.

So with the comparison we could directly observe how much more efficient is the code using pointer than the code using vector.

Moreover I modified the compression function and decompression function.
For the decompress function I modified it following the ideas, suc as:

Cut off the if-branches in for-loop.
Announce the same data-type variable together.
Avoid repeating of same variable annoucement.
Make the judgement in if-branches more efficient.

After the modifying I produced three versions of the decompresssion function.

Cut off all the if-branches, decompress the elements in encode list one by one using the same decompress form.

File: Simple8bCode.h, Class: Simple8bCode

Change the decompress form to reduce the unnecessary calculations.

File: Simple8bCodeNew.h, Class: Simple8bCodeNew1

Use 2 for-loop to realize the role of if-branches in the old decompression function.

File: Simple8bCodeNew.h, Class: Simple8bCodeNew2

For the compression function which provided by my supervisor I have changed it in several aspects and produced only one version of Simple8b algorithm.

Instead of "continue" after encoding 120 continuous 0s update the variable "itemLeft" and directly check the selectors from selector 2. "
Combine one if-branches with one while-loop and move the "asser"-check behind, so that the times of doing the "asser"-check are reduced.
Change the judgement of checking whether we have compressed as many as possible plaintext in a word of encode list to be more efficient.

File: Simple8bCode.h, Class: Simple8bCode

All the variants above are using pointer.

Adding an if-loop to make sure that the quantity of works in a code is exactly the same number which the selector of this code represents. Using the verified decompression function. This variant uses vector.

File: Simple8bCodeVector.h, Class: Simple8bCodeVector

Afterwards I run the tests of simulation on all the algorithms. There is a difference between Elias-Gamma, Elias-Delta and other algorithms. Without the possibility of compressing zero in Elias-Gamma and Elias-Delta I can only run the non-zero data using the two algorithms. Since the running time of Elias-Gamma and Elias-Delta is very long and our real data contain a lot of zeros, Elias-Gamma and Elias-Delta are not important to us. Due to the long running time, Golomb algorithm is also not important.

At last I tested only the Variable-Bytes and all the versions of Simple8b algorithms on the real data.

Simulation

I tested all the algorithms in 3 aspects: compress time, decompress time, used bytes.

In the case of uniform distribution and the combination of uniform distribution and geometric distribution the compress time of Variable-Bytes algorithm is the shortest. In this case the Golomb algorithm has used least bytes. Significantly, the used bytes of Variable-Bytes algorithm are less than the Simple8b algorithm.

The decompress time of Simple8b algorithm is the shortest in all the cases.

In the case of geometric distribution the Simple8b algorithm has the same compress time with Variable-Bytes, which is the shortest in all the algorithm except Elias-Gamma and Elias-Delta. In this case the used bytes are the least.

Test on the real data

I only tested the Variable-Bytes algorithm and all the versions of Simple8b algorithm in the same 4 aspects as in simulation.

There are six different data types: docids.ent, docides.noent, scores.ent, scores.noent, wordids.ent, wordids.noent.

In data types docids.ent, docides.noent, scores.ent, wordids.ent and wordids.noent the Variable-Bytes algorithm has the shortest compress time.

The Simple8b algorithm with the optimal decompress function has the shortest decompress time in data types docids.ent, docides.noent, wordids.ent and wordids.noent. And the Simple8b algorithm uses less bytes than Variable-Bytes algorithm in data types docids.ent, docides.noent, scores.ent, scores.noent and wordids.noent.

The optimal compression function in Simple8b algorithm (Simple8bCode.h) has the shortest compress time in the scores.noent data type. Significantly, the optimal Simple8bCode has always the shorter compression time and decomression time than the old Simple8bCode from Björn.

In the average case in all data types Variable-Bytes is more efficient than Simple8b in the aspect of compress time, and optimal Simple8b is more efficient than Variable-Bytes in the aspect of decompress time. Moreover the optimal Simple8bCode(Simple8bCode.h) has the shortest compression time and shortest decompression time among all the variants of Simple8b algorithm.

Plots

***Here are the partial plots of the results. For all the plots please go to the "Download" webpage.***

SVN-Link

Code

Description	Data
The Elias-Delta algorithm using pointer.	EliasDeltaEncodingPointer.h
The Elias-Delta algorithm using vector.	EliasDeltaEncodingVector.h
The Elias-Gamma algorithm using pointer.	EliasGammaEncodingPointer.h
The Elias-Gamma algorithm using vector.	EliasGammaEncodingVector.h
The Golomb algorithm using pointer.	GolombEncodingPointer.h
The Golomb algorithm using vector.	GolombEncodingVector.h
The one variant of the Simple8b algorithm using pointer.	Simple8bCode.h
The one variant of the Simple8b algorithm using pointer and containing several object methods.	Simple8bCodeClass.h
The other six variants of the Simple8b algorithm using pointer.	Simple8bCodeNew.h
The version of the Simple8b algorithm using pointer from Björn Buchhold.	Simple8bCodeOld.h
The Simple8b algorithm using vector.	Simple8bCodeVector.h
The Variable-Bytes algorithm using pointer.	VariableByteEncodingPointer.h
The Variable-Bytes algorithm using vectorß.	VariableByteEncodingVector.h
All the files of the project.	code.zip

Result

Description	Data
All the test results of the project.	result.zip

Article	Origin	Author
Index compression using 64-bit words	Softw. Pract. Exper. 2010; 40:131–147	Vo Ngoc Anh and Alistair Moffat
lecture-04.pdf of Information Retrieval 2015/2016	11-14	Prof. Dr. Hannah Bast

Efficient Code For (De)Compression

- Zhiwei Zhang

Information

Briefly Description of the Project

Elias-Gamma

Elias-Delta

Golomb

Variable-Bytes

Simple8b

Simulation

Test on the real data

Plots

Simulation

Simulation

Simulation

Simulation

Simulation

Simulation

Test on Real Data

Test on Real Data

Test on Real Data

SVN-Link

Code

Result