Football Data Extraction for Broccoli

Master Project at the Department for Algorithms and Datastructures, Universität Freiburg
by Jonas Bischofberger
SS 2017 - SS 2018
Supervision: Prof. Dr. Hannah Bast

Link to repository

Introduction
Installation
Corpus
Rough procedure
Detailed procedure

Introduction

The Broccoli search engine answers queries about a broad range of entities, but lacks information in more specific domains. The task was to choose an appropriate one of these domains, obtain relational and full-text data from that domain and integrate it into the current Broccoli version. For this project, data about association football players (e.g. height, birth date, current team) and teams (e.g. date of foundation) was chosen.

Installation

See the README file in the project repository for instructions on how to generate the data and how to set up a Broccoi instance that uses it.

Corpus

Relational data was scraped from profiles of the website soccerway.com. Those profiles are comprehensive and have a simple layout which makes it easy to extract the data.

Text data was extracted from Wikipedia articles of the players and teams.

Rough procedure

The profiles are obtained via Common Crawl. URLs from the soccerway domain are obtained via the Common Crawl index. A simple rule-based classifier finds whether a URL points to a player profile or a team profile or an uninteresting page. In the following, the profiles are downloaded and stored both with a timestamp and an ID that is present in the soccerway URL. The extraction routine extracts the facts listed in the profile. Only the latest version of each profile is used for this process.

To obtain additional full-text data, the name of the entity (player or team) is taken to query a wikipedia article. When a disambiguation page is retrieved, the first suggested article is used instead. An entity for which no wikipedia article can be retrieved remains without text information. A simple rule-based procedure that looks for words like "football" and "soccer" esnures that the obtained wikipedia article are about footballers and football clubs rather than famous namesakes.

Detailed procedure

1. Query (m01_query.py)

The CC index is queried with the URL template "soccerway.com/*".

2. Filtering (m02_filter.py)

The index responses are filtered according to the following rules:

A URL points to a team profile if it contains a number (= ID) and the string "/teams/" and none of the strings "/matches", "national" and "CID=TN".
A URL points to a player profile if it contains a number (= ID) and the string "/players/" and none of the strings "/statistics/", "/transfers/", "/trophies/", "/venue/", "/news/", "/squad/" and "/matches/".

3. Download (m03_download.py)

The profiles are downloaded from Common Crawl.

Naming scheme: <ID>_<timestampCrawl>.html (+ different folders for players and teams)
If a file already exists, it is not downloaded again.
The URL is written as a html comment on top of the file for later use.

4. Extraction (m04_extract.py)

KB information is extracted after the following rules.

4.1 Team profiles

Example page: https://us.soccerway.com/teams/spain/futbol-club-barcelona/2017/

Team name

The team name is simply the first heading in the html page, e.g. "FC Barcelona"

Write triples (team_name, is-a, Sports team) and (team_name, is-a, Soccer team) to "soccer-ontology.txt"

Location

The city a team is located in is extracted by taking the bottom line from the "Address" field in the "Info" table.

Write triple (team_name, Sport Team Location, city) to "soccer-ontology.txt"

Info

The following entries are scraped from the "Info" talbe:

Founding date
Country

Corresponding KB entries are written to "soccer-ontology.txt"

Stadium

Some info about a team's stadium can be scraped from the "Venue" table, namely the name and the capacity. Each Stadium is treated as an entity just as players and teams. The following triples are added as far as stadium "Name" and "Capacity" are present.

(stadium_name, is-a, Stadium)
(team_name, Arena/Stadium, stadium_name) and its reverse
(stadium, Capacity, the_capacity)

4.2 Player profiles

Example page: https://us.soccerway.com/players/lionel-andres-messi/119/

Name

The name under which a person is best known is composed as follows:

full_name = "First Name" entry of the "Passport" table (e.g. "Lionel Andrés") + "Last Name" entry of the "Passport" table (e.g. "Messi Cuccittini")
short_name = the first heading in the file (e.g. "L. Messi") where each word that ends with a "." is replaced with the first word with the same first letter that appears in full_name. (-> "Lionel Messi")
Write triples (short_name, is-a, Person) and (short_name, is-a, Football Player) to "soccer-ontology.txt"

Info

The following entries are scraped from the "Passport" table:

Nationality
Date of birth
Country of birth
Place of birth
Position
Height
Weight
Stronger foot

Corresponding KB entries are written to "soccer-ontology.txt"

Team

The most recent team of a player is the uppermost team in the "Career" table. The team name in this table is a hyperlink. The ID in this link is used to look up whether the team has already been scraped.

If the team has been scraped: Use the scraped name.
If the team has not been scraped or there is no number in the hyperlink: Use the name in the table.
Write the triple (short_name, plays for club, team_name) and its reverse to "soccer-ontology.txt"

4.3 Entity Scores

Players, teams and stadiums have associated scores for ranking. The scores are taken from a reference file of the standard Broccoli with the following exceptions:

If a name does not appear in the reference file, the score is "1".
If a player profile does not have all entries in the "Info" table specified above, filled, the score is "1".
If a name appears in a short, hand-made list of famous namesakes, the score is "1".

Entity-score pairs are written to "soccer-ontology.entity-scores.noabs"

5. Obtaining text (m05_get_wikipedia.py)

For each entity (player/team/stadium) that has been obtained in the preceding step, the following procedure is run:

Find an appropiate article by querying wikipedia with the entity name.
Get the summary of this article.
Check if the article is football-related by checking whether the summary contains either of the strings "football", "stadium" or "soccer".
If it is, get URL and plain text content of the wikipedia page and store URL, article name, entity name and content in a file.

A file that is already present is replaced only if its older than a certain timespan (3 months)

If no article can be found for a entity, no text is stored for this entity.

6. Parsing text (m06_parse_wikipedia.py)

The plain text files have to be converted into a special format to be useful for Broccoli. See here.

All text files are traversed one after another sentence by sentence and for each sentence word by word. During this process, various information is extracted and written to different files.

soccer.docs-by-contexts

For each sentence write a row:

1. column: n if we're at the nth sentence across all files.
2. column: The name of the wikipedia article
3. column: The URL of the wikipedia article
4. column: The sentence word by word, each one followed by "@@ " and a single "." at the end.
5. column: "0-<number_of_words_in_current_sentence - 1>".
6. column: "NoParseTree".
7. column: Same as 5. column

soccer.words-by-contexts

For each word write a row:

1. column: The word
2. column: global sentence number (as above)
3. column: 1
4. column: word number (per sentence)

In between, there need to be special entries when entities are mentioned. As a simple rule, every sequence of capitalized words is treated as an entity. If such a sequence is encountered, add the following entry:

1. column: The sequence in entity format (e.g. ":e:The_English_Premier_League")
2. column: global sentence number (as above)
3. column: 1
4. column: <number_of_words_in_current_sentence - number_of_words_in_current_entity/sequence>

soccer.vocabulary

This file is filled with all words (and identified sequences in the ":e:" format) that appear in the text files. One word per line.

7. Fixing scores (m07_fix_scores.py)

The wikipedia articles are used to remove some additional bad scores due to synonyms/namesakes:

Create a list of all wikipedia article names collected in step 5.
Replace the score of all entites that do not appear in this list with 1.

Football Data Extraction for Broccoli

Table of Contents

Introduction

Installation

Corpus

Rough procedure

Detailed procedure

1. Query (m01_query.py)

2. Filtering (m02_filter.py)

3. Download (m03_download.py)

4. Extraction (m04_extract.py)

4.1 Team profiles

Team name

Location

Info

Stadium

4.2 Player profiles

Name

Info

Team

4.3 Entity Scores

5. Obtaining text (m05_get_wikipedia.py)

6. Parsing text (m06_parse_wikipedia.py)

soccer.docs-by-contexts

soccer.words-by-contexts

soccer.vocabulary

7. Fixing scores (m07_fix_scores.py)