The Broccoli search engine answers queries about a broad range of entities, but lacks information in more specific domains. The task was to choose an appropriate one of these domains, obtain relational and full-text data from that domain and integrate it into the current Broccoli version. For this project, data about association football players (e.g. height, birth date, current team) and teams (e.g. date of foundation) was chosen.
See the README file in the project repository for instructions on how to generate the data and how to set up a Broccoi instance that uses it.
Relational data was scraped from profiles of the website soccerway.com. Those profiles are comprehensive and have a simple layout which makes it easy to extract the data.
Text data was extracted from Wikipedia articles of the players and teams.
The profiles are obtained via Common Crawl. URLs from the soccerway domain are obtained via the Common Crawl index. A simple rule-based classifier finds whether a URL points to a player profile or a team profile or an uninteresting page. In the following, the profiles are downloaded and stored both with a timestamp and an ID that is present in the soccerway URL. The extraction routine extracts the facts listed in the profile. Only the latest version of each profile is used for this process.
To obtain additional full-text data, the name of the entity (player or team) is taken to query a wikipedia article. When a disambiguation page is retrieved, the first suggested article is used instead. An entity for which no wikipedia article can be retrieved remains without text information. A simple rule-based procedure that looks for words like "football" and "soccer" esnures that the obtained wikipedia article are about footballers and football clubs rather than famous namesakes.
The CC index is queried with the URL template "soccerway.com/*".
The index responses are filtered according to the following rules:
The profiles are downloaded from Common Crawl.
KB information is extracted after the following rules.
Example page: https://us.soccerway.com/teams/spain/futbol-club-barcelona/2017/
The team name is simply the first heading in the html page, e.g. "FC Barcelona"
Write triples (team_name, is-a, Sports team) and (team_name, is-a, Soccer team) to "soccer-ontology.txt"
The city a team is located in is extracted by taking the bottom line from the "Address" field in the "Info" table.
Write triple (team_name, Sport Team Location, city) to "soccer-ontology.txt"The following entries are scraped from the "Info" talbe:
Corresponding KB entries are written to "soccer-ontology.txt"
Some info about a team's stadium can be scraped from the "Venue" table, namely the name and the capacity. Each Stadium is treated as an entity just as players and teams. The following triples are added as far as stadium "Name" and "Capacity" are present.
Example page: https://us.soccerway.com/players/lionel-andres-messi/119/
The name under which a person is best known is composed as follows:
The following entries are scraped from the "Passport" table:
Corresponding KB entries are written to "soccer-ontology.txt"
The most recent team of a player is the uppermost team in the "Career" table. The team name in this table is a hyperlink. The ID in this link is used to look up whether the team has already been scraped.
Players, teams and stadiums have associated scores for ranking. The scores are taken from a reference file of the standard Broccoli with the following exceptions:
Entity-score pairs are written to "soccer-ontology.entity-scores.noabs"
For each entity (player/team/stadium) that has been obtained in the preceding step, the following procedure is run:
A file that is already present is replaced only if its older than a certain timespan (3 months)
If no article can be found for a entity, no text is stored for this entity.
The plain text files have to be converted into a special format to be useful for Broccoli. See here.
All text files are traversed one after another sentence by sentence and for each sentence word by word. During this process, various information is extracted and written to different files.
For each sentence write a row:
For each word write a row:
In between, there need to be special entries when entities are mentioned. As a simple rule, every sequence of capitalized words is treated as an entity. If such a sequence is encountered, add the following entry:
This file is filled with all words (and identified sequences in the ":e:" format) that appear in the text files. One word per line.
The wikipedia articles are used to remove some additional bad scores due to synonyms/namesakes: