Football Data Extraction for Broccoli

Master Project at the Department for Algorithms and Datastructures, Universität Freiburg
by Jonas Bischofberger
SS 2017 - SS 2018
Supervision: Prof. Dr. Hannah Bast

Link to repository

Table of Contents


The Broccoli search engine answers queries about a broad range of entities, but lacks information in more specific domains. The task was to choose an appropriate one of these domains, obtain relational and full-text data from that domain and integrate it into the current Broccoli version. For this project, data about association football players (e.g. height, birth date, current team) and teams (e.g. date of foundation) was chosen.


See the README file in the project repository for instructions on how to generate the data and how to set up a Broccoi instance that uses it.


Relational data was scraped from profiles of the website Those profiles are comprehensive and have a simple layout which makes it easy to extract the data.

Text data was extracted from Wikipedia articles of the players and teams.

Rough procedure

The profiles are obtained via Common Crawl. URLs from the soccerway domain are obtained via the Common Crawl index. A simple rule-based classifier finds whether a URL points to a player profile or a team profile or an uninteresting page. In the following, the profiles are downloaded and stored both with a timestamp and an ID that is present in the soccerway URL. The extraction routine extracts the facts listed in the profile. Only the latest version of each profile is used for this process.

To obtain additional full-text data, the name of the entity (player or team) is taken to query a wikipedia article. When a disambiguation page is retrieved, the first suggested article is used instead. An entity for which no wikipedia article can be retrieved remains without text information. A simple rule-based procedure that looks for words like "football" and "soccer" esnures that the obtained wikipedia article are about footballers and football clubs rather than famous namesakes.

Detailed procedure

1. Query (

The CC index is queried with the URL template "*".

2. Filtering (

The index responses are filtered according to the following rules:

3. Download (

The profiles are downloaded from Common Crawl.

4. Extraction (

KB information is extracted after the following rules.

4.1 Team profiles

Example page:

Team name

The team name is simply the first heading in the html page, e.g. "FC Barcelona"

Write triples (team_name, is-a, Sports team) and (team_name, is-a, Soccer team) to "soccer-ontology.txt"


The city a team is located in is extracted by taking the bottom line from the "Address" field in the "Info" table.

Write triple (team_name, Sport Team Location, city) to "soccer-ontology.txt"


The following entries are scraped from the "Info" talbe:

Corresponding KB entries are written to "soccer-ontology.txt"


Some info about a team's stadium can be scraped from the "Venue" table, namely the name and the capacity. Each Stadium is treated as an entity just as players and teams. The following triples are added as far as stadium "Name" and "Capacity" are present.

4.2 Player profiles

Example page:


The name under which a person is best known is composed as follows:


The following entries are scraped from the "Passport" table:

Corresponding KB entries are written to "soccer-ontology.txt"


The most recent team of a player is the uppermost team in the "Career" table. The team name in this table is a hyperlink. The ID in this link is used to look up whether the team has already been scraped.

4.3 Entity Scores

Players, teams and stadiums have associated scores for ranking. The scores are taken from a reference file of the standard Broccoli with the following exceptions:

Entity-score pairs are written to "soccer-ontology.entity-scores.noabs"

5. Obtaining text (

For each entity (player/team/stadium) that has been obtained in the preceding step, the following procedure is run:

  1. Find an appropiate article by querying wikipedia with the entity name.
  2. Get the summary of this article.
  3. Check if the article is football-related by checking whether the summary contains either of the strings "football", "stadium" or "soccer".
  4. If it is, get URL and plain text content of the wikipedia page and store URL, article name, entity name and content in a file.

A file that is already present is replaced only if its older than a certain timespan (3 months)

If no article can be found for a entity, no text is stored for this entity.

6. Parsing text (

The plain text files have to be converted into a special format to be useful for Broccoli. See here.

All text files are traversed one after another sentence by sentence and for each sentence word by word. During this process, various information is extracted and written to different files.

For each sentence write a row:


For each word write a row:

In between, there need to be special entries when entities are mentioned. As a simple rule, every sequence of capitalized words is treated as an entity. If such a sequence is encountered, add the following entry:


This file is filled with all words (and identified sequences in the ":e:" format) that appear in the text files. One word per line.

7. Fixing scores (

The wikipedia articles are used to remove some additional bad scores due to synonyms/namesakes:

  1. Create a list of all wikipedia article names collected in step 5.
  2. Replace the score of all entites that do not appear in this list with 1.