Efficient and Effective Search on Wikidata Using the QLever Engine

Bachelor project and thesis in summer semester 2018 by Johannes Kalmbach

Supervised by Prof. Dr. Hannah Bast

Overview

Wikidata is probably the most interesting general-purpose knowledge base at the moment. This project mainly had two goals: On the one hand we improved the QLever SPARQL engine so that it could work with the huge Wikidata dataset. On the other hand we explored methods to effectively create SPARQL queries for Wikidata.

Introduction

Wikidata is a creative commons knowledge base maintained by the Wikimedia project. QLever is a semantic search engine developed at the Chair of Algorithms and Data Structures at the University of Freiburg. QLever is able to efficiently retrieve information from knowledge bases like Wikidata using the SPARQL query language. The project and the thesis described here had two goals: The first goal was to make QLever work with the complete Wikidata. This was originally not possible because QLever had a very high RAM usage that did not scale well with the size of the knowledge base. Thus it was not possible to run QLever with Wikidata even on machines with 200 GB of RAM. We have implemented several improvements to the QLever engine such that it now runs with Wikidata on less than 32GB of RAM. The other goal was to create a user interface for QLever that also allows non-expert users to formulate SPARQL queries on this interesting dataset.

Effective Construction of Queries

In Wikidata, every entity is identified by an abstract id. For example the English writer Douglas Adams is identified by the id Q42. This makes query construction hard since you first have to find the correct ids for the entities you want to find information about. For example when we want to find German cities and their population we could conceptually try the following SPARQL query:
SELECT ?city ?population WHERE {
  ?city <is-a> <city> .
  ?city <contained-in> <germany> .
  ?city <population-count> ?population .
}
On Wikidata however we have to formulate this query in the following way because of the abstract entity names:
PREFIX wd:<http://www.wikidata.org/entity/>
PREFIX wdt:<http://www.wikidata.org/prop/direct/>
SELECT ?name ?population WHERE {
  ?city wdt:P31 wd:Q515 .
  ?city wdt:P17 wd:Q183 .
  ?city wdt:P1082 ?population .
}
We have implemented a search engine called the EntityFinder that finds the ids of Wikidata entities by their readable names. For example when typing "city" the engine suggests "Q515", Wikidata's representation of the concept of a city. Based on this entity finder we have implemented a simple drag-and-drop user interface that allows the creation of simple SPARQL queries without requiring knowledge about the exact syntax of SPARQL or of the internal IDs in Wikidata.

Running the EntityFinder

The EntityFinder first requires input text files that state the names, aliases and descriptions for all the entities that will be searched for. It is shipped with a python script that automatically extracts this format from Wikidata's JSON dumps. After this input has been created the EntityFinder can be compiled and run. It is also possible to easily run it inside a Docker container. The source code of the EntityFinder is shipped with a detailed README.md that contains step-by-step instructions for both approaches.

Contributions To QLever

In this section I will link and shortly describe all the contributions I made to the QLever software. Complete descriptions of the implemented mechanism can be found in my thesis.

Efficient Creation of QLever's Vocabulary

QLever internally maps all the tokens from the knowledge base to an id according to their lexicographical order. This set of all tokens is called the vocabulary. The original way of creating this mapping from tokens to ids was not feasible for Wikidata because it consumed more than 200GB of RAM. The main reason for this was that multiple copies of the complete vocabulary had to be kept in RAM at the same time. We implemented an improved version of this algorithm that first processes smaller parts of the vocabulary and afterwards merges them. This drastically reduces the RAM footprint of this step.

Implementation of a Disk-Based Dynamic Array

We have implemented a dynamic array that is permanently stored in a file and mapped to virtual RAM addresses using the POSIX system call mmap. We call this data structure and MmapVector. In QLever the MmapVector can for example be used to store the offsets or the meta data of the permutations, QLever's main data structure for processing a knowledge base. Originally these offsets were stored in a hash table in RAM. Externalizing this information to disk using the MmapVector can save more than 100GB of RAM without significantly increasing the time needed for query processing.

Prefix Compression of QLever's vocabulary

In Wikidata all entity names are URIs that often share a common beginning. For example a lot of entities start with http://www.wikidata.org/entities. We have implemented a fast greedy algorithm that can automatically find such common prefixes and use them for compression. This can reduce the memory footprint of QLever's vocabulary by 45% on the Wikidata knowledge base.

Efficient Language Filter Implementation

Wikidata contains literals in many languages. Often one is only interested in literals in a certain language. An example for this is when we want to find exactly one description for an entity in a language that we understand (e.g. English). This feature is provided by SPARQL language filters. This pull requests implements efficient language filters for the QLever engine.

Efficient Creation of Permutations

This pull requests uses the MmapVector described above during the creation of the permutations which are QLever's main data structure. This way no information that is only stored on disk during the runtime of the QLever engine has to be kept in RAM during the preprocessing.

Future Work

J.Bürklin and D.Kemen have implemented a convenient User interface for QLever that supports the context-sensitive autocompletion of SPARQL queries. Currently this useful feature only works with the Freebase-Easy knowledge base that directly uses human-readable entity names. In the future we want to combine this functionality with the EntityFinder software for Wikidata to make the autocompletion also work on this dataset. This would combine the advantages of my UI (being able to work with readable names) with the advantages of Bürklin and Kemen's UI (being able to formulate arbitrary SPARQL queries).