Overview
This project aims at simplifying the creation of SPARQL-queries for the knowledge base Freebase. Instead of finding out the relevant Freebase types and relations by hand, the user specifies table columns in a simple table description format.
Introduction
In the knowledge base Freebase, Writing a SPARQL-query for Freebase to extract some information can be quite challenging, especially when one does not know the respective Freebase types and relations. To obtain a list of all cities their corresponding countries and their population, the relevant Freebase types would be location.citytown, location.country and location.statistical_region.population. Furthermore, one has to know that city and country have to be linked by the relation location.location.containedby and that the population is obtained via the mediator "measurement_unit.dated_integer. So the designed query should look like the following:
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?city_name ?country_name ?population WHERE {
?city fb:type.object.type fb:location.citytown .
?city fb:type.object.name ?city_name .
?country fb:type.object.type fb:location.country .
?country fb:type.object.name ?country_name .
?city fb:location.location.containedby ?country .
?city fb:location.statistical_region.population ?dated_int .
?dated_int fb:measurement_unit.dated_integer.number ?population .
}
Wouldn’t it be much easier for the user to just specify the columns of the desired result? So the following table specification should lead to the same above-mentioned SPARQL-query:
<location.citytown> | <location.country> | <location.statistical_region.population>
This project even goes a little bit further and tries to solve the situation where a user doesn't have to know the exact Freebase types and relations. For example, when the user only knows the fuzzy column names:
City | Country | Population
While typing a fuzzy definition, the program provides suggestions for relevant types taking the number of occurences in the database into account.
Table description format
In the table description format columns are separated by "|". The columns can either be exactly or fuzzily specified whereby fuzzy definitions are not supported for translation into queries. Additionally, for any column filters and an order can be set. A column can also be explicitly linked to another column, when the query translator's column linking is not as intended.
Fuzzy column definitions
City | Country | Population
Exact column definitions
<location.citytown> | Country | <location.statistical_region.population>
Explicitly linking a column to another
Linking to another column defined by index (starting with 0)
City | Country | Population -> 0
Setting filters
Supported relational operators: ==, !=, <, <=, >, >=
City(== "Berlin"@en) | Country | Population(<= 50000)
Define an order
Single order:
City | Country | Population [DESC]
Multi-order (a rank in applying the orders has to be specified):
City [ASC, 1] | Country | Population [DESC, 0]
Translation into SPARQL queries
In order to translate a given table description, the program first generates pairs of the parsed columns such that all columns are paired with each other. For any pair the program checks how the columns could match by using predefined rules and templates. To speed up the checks, the following data from the project Aqqu is used:
List of mediator types, list of mediator relations (relations leading to a mediator), list of relations' expected types, list of relations' target types distribution, list of relations with a reverse relation
This data was extracted from a Freebase dump.
Autocomplete fuzzily defined columns
While typing column definitions, the user gets type and relation suggestions. For this, the autocomplete function performs a fuzzy prefix search, allowing a maximum prefix edit distance of 1. The relevant results are ranked according to the following criteria (in descending priority):
- Types and relations that do not start with "base" or "user" are preferred
- BM25-score (k=1.75, b=0.75)
- Types are preferred over relations
- Count of the type/relation
Templates
Type-Type-Template (TT)
The pair (<location.citytown>, <location.country>)
can be matched with this template.
Removing the "DISTINCT" in the query will result in the counts for each candidate relation. The one with the highest count in this case is
<location.location.containedby>
.
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation WHERE {
?col1 fb:type.object.type [TYPE_COL1] .
?col2 fb:type.object.type [TYPE_COL2] .
?col1 ?candidate_relation ?col2 .
}
Type-Mediator-Type-Template(TMT)
The pair (<film.film>, <film.actor>)
can be matched with this template. The types are connected via the mediator <film.performance>
.
The best matching candidate relations are (<film.film.starring>, <film.performance.actor>)
.
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation1 ?candidate_relation2 WHERE {
?col1 fb:type.object.type [TYPE_COL1] .
?col2 fb:type.object.type [TYPE_COL2] .
?col1 ?candidate_relation1 ?mediator .
?mediator ?candidate_relation2 ?col2 .
}
Type-Relation-Template (TR)
This template matches a type with a direct relation like (<olympics.olympic_games>, <time.event.start_date>)
.
Only the total resultsize can be considered here, because no candidate relation has to be find.
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?col2 WHERE {
?col1 fb:type.object.type [TYPE_COL1] .
?col1 [REL_COL2] ?col2 .
}
Type-Mediator-Relation-Template (TMR)
This template matches a type with an indirect relation like (<location.citytown>, <location.geocode.latitude>)
via the mediator <location.geocode>
and matches the relation <location.location.geolocation>
best.
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation WHERE {
?col1 fb:type.object.type [TYPE_COL1] .
?col1 ?candidate_relation ?mediator .
?mediator [REL_COL2] ?col2 .
}
Type-Relation-Mediator-Template (TRM)
This template matches a type with a direct mediator relation like (<location.citytown>, <location.statistical_region.population>)
.
The mediator is <measurement_unit.dated_integer>
.
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation WHERE {
?col1 fb:type.object.type [TYPE_COL1] .
?col1 [REL_COL2] ?mediator .
?mediator ?candidate_relation ?col2 .
}
Results
The program simplifies the creation of desired queries for Freebase by helping to explore types and relations and automatically finding linking relations
between columns. Furthermore, the user does not need to know if a type has a name or not (each named type requires an additional triple in the query like
?citytown fb:type.object.name ?citytown_name
) and whether a mediator has to be used for linking or not.
Largest cities
Table description:
<location.citytown> | <location.country> | <location.statistical_region.population> [DESC] | <location.geocode.latitude> | <location.geocode.longitude>
Generated query:
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?citytown_name ?country_name ?population_name ?latitude_name ?longitude_name WHERE {
?citytown fb:type.object.type fb:location.citytown .
?citytown fb:type.object.name ?citytown_name .
?country fb:type.object.type fb:location.country .
?country fb:type.object.name ?country_name .
?citytown fb:location.location.containedby ?country .
?citytown fb:location.statistical_region.population ?dated_integer .
?dated_integer fb:measurement_unit.dated_integer.number ?population_name .
?citytown fb:location.location.geolocation ?geocode .
?geocode fb:location.geocode.latitude ?latitude_name .
?geocode fb:location.geocode.longitude ?longitude_name .
}
ORDER BY DESC(?population_name)
Query delivers the intended results.
Largest cities
Table description:
<film.film> | <film.film.initial_release_date> [DESC, 0] | <film.film_genre> [ASC, 1] | <film.film.country>(=="Germany"@en) | <film.director>
Generated query:
PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?film_name ?initial_release_date_name ?film_genre_name ?country_name ?director_name WHERE {
?film fb:type.object.type fb:film.film .
?film fb:type.object.name ?film_name .
?film fb:film.film.initial_release_date ?initial_release_date_name .
?film_genre fb:type.object.type fb:film.film_genre .
?film_genre fb:type.object.name ?film_genre_name .
?film fb:film.film.genre ?film_genre .
?film fb:film.film.country ?country .
?country fb:type.object.name ?country_name .
?director fb:type.object.type fb:film.director .
?director fb:type.object.name ?director_name .
?film fb:film.film.directed_by ?director .
FILTER(?country_name == "Germany"@en) .
}
ORDER BY DESC(?initial_release_date_name) ASC(?film_genre_name)
Query delivers the intended results.
Future work
It is planned to continue the project in a master thesis. For this purpose, the following parts allow further research and improvement:
- Evaluation using Wikipedia tables
Basic idea is to find Wikipedia tables whose content also exists in Freebase (or Wikidata).
- Allowing fuzzily defined columns for query translation
At the moment, the program only allows exact table descriptions for query translation. Fuzzy columns would cause a much bigger set of exact definition pairs to be matched. A smarter ranking approach is needed here.
- Allowing to readjust matched relations
In some cases, the user may not want the matched relation with the most occurences. Dropdown lists for readjusting each column pair would help the user to find the desired relations.
- Adapting project to Wikidata backend
Since Freebase was shut down in 2016 and its data will be moved to Wikidata. Adapting the project to a Wikidata backend would maintain usability in the future.