Features and Efficiency Issues for QLever

Florian Kramer
30.07.2018

SPARQL is a feature rich query language designed for querying knowledge bases. The goal of this project is to improve the amount of features of the SPAQRL language that the QLever query engine supports. A running demo of QLever s publicly available here. The code is freely available on github , together with a small sample dataset for experimentation. A larger dataset can be acquired from the wikidata project.

Features

What follows is a list of features implemented over the course of this project.

Optional

Certain types of queries require adding information to a set of rows if it is available, and omitting it otherwise. SPARQL supports these queries with the OPTIONAL keyword.

Conceptually the graph pattern within an optional block and the one without a computed first, then an outer join is performed on the two result tables, inserting a special NO_VALUE id, when no data is found that match the non optional part. As forcing this order of computation may make the query slower by orders of magnitude the implementation also support optimizing the outer join using the standard query optimizer employed by QLever. For types of queries commonly used this does not provide a speed improvement though, as the optional part doesn't reduce the size of a result table and the optimizer's performance is currently not linear in the number of joins, sometimes making the optimization of the optional join more expensive than the reduction of the query computation time.

Example Query

SELECT ?person ?advisor WHERE {
  ?person   .
  OPTIONAL {
    ?person  ?advisor .
  }
}

Pattern Trick

A highly useful feature for query engine front ends are as-you-type suggestions. To acquire suggestions for predicates a user could enter, given a query and a subject, a SPARQL query containing the input query and a triple with the subject and two variables can be used. Running this form of query without any optimizations can be very slow though, as there are potentially hundreds of thousands of entities for which the predicates and all possible objects are returned.

The broccoli search engine solves this problem, by recognizing patterns in the sets of relations entities have (e.g. most books having an author and a title), and uses these patterns to accelerate queries that ask for the predicates and their counts available to a given set of entities. QLever now uses the same trick to facilitate fast queries for available predicates.

During index creation patterns in the sets of predicates an entity has are detected and sorted by number of occurence. The n most common patterns are then selected. In a second pass two new relations are created, has-pattern and has-predicate. has-pattern contains all entities of which the predicate sets exactly matches one of the patterns, has-predicate stores the predicates for all other entities. This information is than stored in file belonging to the index. When a query asking for patterns and their counts needs to be answered during runtime, the two relations are used to create a count of all predicate of entities that did not match a pattern, as well as a count of all pattern matches. In a second step the pattern matches are then added to the predicate counts, by directly adding the count of the pattern to the count of all predicates contained within the pattern.

Example Query

SELECT ?p (COUNT(?p) as ?count) WHERE {
  ?person <is-a> <Scientist> .
  ?person ql:has-predicate ?p .
}
GROUP BY ?p
ORDER BY DESC(?count)

Language Filter

Wikidata's schema:name relation contains entries in many languages for most entities. To allow for filtering the output of a query for languages a naive language filter was added. When applying the filter the id of every entity in the filter column is resolved to the entities name, a postfix check for the correct language tag is run on the name and the id is added to the filter result or discarded depending on the check's result.

This implementation of the langauge filter is currently slow, due to having to look up every entities name. It also suffers when there are many languages bloating the schema:name (or other name relation) relation, as it does not prevent the full relation from being loaded and joined to the result. As name relation tend to be large, loading and joining the name relation with the result is a step that would optimally be avoided in a faster implementation.

Example Query

German cities with only their german name.


PREFIX we: <http://www.wikidata.org/entity/>
PREFIX wp: <http://www.wikidata.org/prop/direct/>
PREFIX sc: <http://schema.org/>
SELECT ?name WHERE {
  ?city wp:P31 we:Q515 .
  ?city wp:P17 we:Q183 .
  ?city sc:name ?name .
  FILTER langMatches(lang(?name), "de")
}
ORDER BY DESC(?population)

Group By

SPARQL allows for some built in processing of results. This includes a grouping mechanism, by which the results are grouped based upon equality on one or more columns. SPARQL then allows for retrieving the columns on which the result was grouped on as well as the result of a handful of functions on the group (such as COUNT or AVG which agglomerate the groups entries into a single result).

While many parts of the implementation of group by are straight forward (such as the current syntax parsing, or splitting the result into groups by sorting) there are several problems that needed to be solved in an efficient manner. One of the aggregates supported by SPARQL is GROUP_CONCAT which concatenates the string values of a groups column into a single string. As QLever stores all entities and predicates as ids internally, and only resolves these ids when creating the result json a new representation for strings not contained in the index's vocabulary. The solution used in QLever's current implementation is the addition of types to a results columns and the introduction of a result local vocabulary. When group concat is used the resulting strings are stored in the local vocabulary, and the result column is marked accordingly. The ids in the column are then interpreted as indices in the local vcabulary instead of the index's vocabulary.
Introducing result types also allowed for returning arbitrary floats from agglomerates such as AVG or SUM. The float is simply encoded into the first 4 bytes of the id (ids being an 8 byte type) using the systems default float encoding. Similarly unsigned integers can be represented using the unsigned Id type directly by marking the result tables column appropriately.
Another problem occurs when counting the size of a group using the naive way of taking the difference of the indices of the first and last element in the sorted result table, as the count is then not dependent upon the counted variable. As there are situations where that is the correct result QLever currently simply returns that difference as the count, unless the DISTINCT keyword is specified within the aggregate, which covers most of the other possible use cases of COUNT. This definition of the return value of count also seems to be employed by wikidata's query service ( http://query.wikidata.org/).

Example Query

Scientists grouped by profession with their average height.


SELECT ?profession (GROUP_CONCAT(?person) as ?names) (AVG(?height) as ?avg) WHERE {
  ?person <is-a> <Scientist> .
  ?person <Profession> ?profession .
  ?person <Height> ?height .
}
GROUP BY ?profession

Bachelor Project at the
University of Freiburg
Department of Computer Science
Chair of Algorithms and Data Structures

Supervised by: Prof. Hannah Bast
Ideas, help and support from: Prof. Hannah Bast, Björn Buchhold, Niklas Schnelle