# Wikification

Author:

Supervisor:

## 1. Introduction

[go back]

Wikification is the process of identifying the important phrases in a document and linking each of them to appropriate articles on Wikipedia based on their context of occurrences. The important phrases in a document are also called keyphrases, somewhat similar to the term keyword, but unlike a keyword , a keyphrase can consist of one or more words. A wikifier is a software that performs the wikification of a document. One such software has been developed and made available here. This page discusses how it was developed from the ground-up. A more detailed report is available here.

### 1.1. The two stages

Wikification can be broadly classified into the following two stages.

• Knowledge-Base creation:
In this stage, a knowledge-base is prepared from the Wikipedia database dump file which is later used by the engine, for the wikification of a document.
• Wikification:
In this stage, keyphrases in the input document are recognized and are linked to appropriate articles on Wikipedia based on their context of occurrences.

## 2. Knowledge-base creation

[go back]

### 2.1. Obtaining the Wikipedia database dump

The Wikipedia database dump file containing the latest pages and articles could be downloaded from here. It contains only the current revisions - exactly what is needed - and no talk or user pages. The Perl module MediaWiki::DumpFile::FastPages , available in CPAN, has been used to parse the uncompressed database dump. It enables simultaneous iteration over the title and content of each page in the database dump in a sequential manner.

### 2.2. Types of pages in the Wikipedia database dump

• Disambiguation pages: The content of one such page on Wikipedia contains one ore more links to existent articles or other disambiguation pages. Such a page could be identified by the presence of the phrase (disambiguation) in its title text. For example, if the phrase Austin (disambiguation) is the title text of a page, then its content would be a list of one or more links as mentioned above.
• Redirect pages: The content of one such page contains a single link that causes Wikipedia to redirect to a page (hereinafter referred to as the target page) referred by that link. It is possible that the target page in the dump file is also of type redirect page. Wikipedia uses the term double redirects to denote such pages. In order to prevent infinite loops, Wikipedia stops the chain after the first redirect. Hence, at the time of knowledge-base creation double redirects are ignored.
• Articles: A page of this type is an article written about the topic contained in its title-text. This is what an user gets to see when they search for an existent article on Wikipedia. For example, following the URL https://en.wikipedia.org/wiki/Chennai, takes to a page on Wikipedia, which is of type article.

### 2.3. Free Links in Wikipedia

In Wikipedia, free links are used to produce internal links between pages, which enable users to access information related to the article they are reading. They are created by enclosing a page-title in double square brackets. Thus, enclosing the word Chennai in double square brackets, like this, [[Chennai]], will cause the text to be anchored with a hyperlink to the respective Wikipedia article, i.e. Chennai, in this case.

Optionally a vertical bar | could be added to customize the anchor text. For example, typing [[Chennai | the capital city of Tamilnadu]] would result in the text the capital city of Tamilnadu being anchored to the respective Wikipedia article, which is, in this case, the aforementioned article.

### 2.4. Keyphrase extraction

As mentioned before, a keyphrase can be composed of one or more words, but for reasons mentioned later in this section further restrictions are imposed on keyphrases. The following set of rules restrict what a keyphrase can be composed of.

#### 2.4.1. Composition of a keyphrase

• A keyphrase can be composed of only alphanumeric characters. However, it is not just limited to alphanumeric characters in the ASCII representation.
• Every non-alphanumeric character found in the keyphrase should be replaced with a single whitespace character.
• It cannot contain punctuation of any form.
• All the letters should be case-folded to lowercase.
• Should a phrase contain multiple words, the words are to be separated by a single whitespace character.
• The number of words in a keyphrase is limited to a maximum of 10.

#### Benefits of these restrictions on keyphrase

These restrictions on the keyphrase offer the following benefits:

• The set of target entities of two keyphrases, which are essentially the same but are under different case foldings could then be combined. Consider two keyphrases austin and Austin each having their own set of referred entities. Let the referred entities of the keyphrase austin be Austin Island and Austin College, and let that of Austin be Austin College and Austin Motor Company.

Applying the rules, the sets of entities of the two keyphrases could then be merged with the case-folded austin acting as the keyphrase with referred entities being Austin Island, Austin College and Austin Motor Company.

• Eliminating the non-alphanumeric characters in the keyphrases provides an uniform way of recognizing the keyphrases in an input document at the time of wikification. Consider, for example, the following excerpt in an input document.

... Alaska’s residents grapple with changing climate ...

The rules for keyphrase extraction when applied to the input document case-folds it to lowercase and replaces each of the non-alphanumeric characters in it with a single whitespace , causing the above excerpt to become as follows.

... alaska s residents grapple with changing climate ...

This makes it possible to recognize the word alaska in the document if such a keyphrase existed in the keyphrase vocabulary. Imagine, without these rules, the entire word would be Alaska’s which may not contain a matching keyphrase in the keyphrase-vocabulary of the knowledge-base.

#### 2.4.2. Extracting the keyphrases from articles

• For free links without a vertical bar, i.e. without |, the enclosed text would act both as the keyphrase and the target entity. For example, for the free link [[Chennai ]], the extracted keyphrase and target entity would be the following.

chennaiChennai

• For free links with a vertical bar, the text before the bar, i.e. the entity text, acts as the target entity whereas the text that follows it acts as the keyphrase. In addition to that, the entity text is also used as a keyphrase, just like it was used in a free link without vertical bar.

For example, for the free link [[Chennai|the capital city of Tamilnadu]], the following keyphrase - target entity pairs are extracted.

the capital city of tamilnaduChennai

chennaiChennai

Doing so, helps improve the keyphrase vocabulary, which in turn could increase the possibility of important phrases in a document being recognized at the time of wikification.

#### 2.4.3. Extracting from page titles

Limiting the keyphrase extraction to free links in articles would severely limit the keyphrase vocabulary leaving several non-linked article titles orphan. In other words, if there was an article on Wikipedia, not linked from any other article, then it will be left unlinked if the keyphrase extraction procedure is limited to extracting keyphrases from the body of the articles, as done above.

#### 2.4.3.1. Extracting the keyphrases from article-titles

As the parser iterates the article titles, keyphrases are extracted from the titles and the titles are added to the set of target entities of those keyphrases. Any information enclosed in parenthesis is discarded. For example, if a page title is Casablanca (film), then the following keyphrase − target entity pair is extracted from it.

casablancaCasablanca (film),

If another page title, such as Casablanca (volcano), is encountered, then another keyphrase − target entity pair is generated with the same keyphrase, i.e. casablanca causing it to have two different target entities.

casablancaCasablanca (film), Casablanca (volcano)

It is necessary to understand why the text within parentheses is discarded. If it were not, then the generated keyphrase would, for example, be casablanca film. It is very unlikely that an input document has this exact word sequence. By discarding the text within parentheses, the chances of a phrase in an input document being found in the keyphrase vocabulary increases. Since there are algorithms to disambiguate entities in the later stages of wikification, this does not cause any problem. Even if the word sequence casablanca film appeared in an input document, the wikifier would still be able to identify the phrase casablanca in it.

#### 2.4.3.2. Extracting from redirect titles

The keyphrase obtained from the title of a redirect page could not use the title as a target entity, since the title redirects to another page on Wikipedia. Therefore, the keyphrase is assigned to the title in the body of the redirect page (hereinafter referred to as the redirect target). Additionally, a keyphrase is extracted from the redirect target and assigned to it. For example, the title Air Transport on Wikipedia redirects to the title Aviation. In this case, the following keyphrase − target entity pairs are extracted.

air transportAviation

aviationAviation.

However, in order to avoid chain-redirects, if a redirect target redirects to another page, no keyphrase − target entity pairs are obtained.

#### 2.4.3.3. Extraction from disambiguation titles

A keyphrase obtained from the title of a disambiguation page has, as its target, each of the articles available in the page. For example, the title Casablanca (disambiguation) is that of a disambiguation page on Wikipedia, containing the following entities.

[[Casablanca (volcano)]]
[[Casablanca Records]]
[[Casablanca (film)]]

As done before, the keyphrase casablanca is extracted from the disambiguation title, and each of the entities in the disambiguation page is assigned to be target of this keyphrase, as shown below.

#### 2.4.3.4. Nested disambiguations

It is possible for an entry in a Wikipedia disambiguation page to link to another disambiguation page. For example, the page Austin (disambiguation) on Wikipedia, contains the following entry in its content, which is the title of another disambiguation page.

[[Austin Station (disambiguation)]]

A disambiguation entity cannot be assigned as the target of a keyphrase. Therefore, the page of the nested disambiguation entity is fetched and all the non-disambiguation entities in its content are assigned to this keyphrase. For example, if the page Austin Station (disambiguation) contained the following content,

[[Austin (Amtrak station)]]
[[Austin (CTA Blue Line)]]
[[Austin (CTA Green Line)]]

all of these entities are assigned to be the target of the keyphrase austin, as shown below.

In addition to that, obviously, as done before, the keyphrase austin station also gets all these target entities, as shown below.

austin stationAustin (Amtrak station), Austin (CTA Blue Line), Austin (CTA Green Line)

#### 2.4.3.5. Chained disambiguations

It is possible for a nested disambiguation page to further contain links to disambiguation pages. Such entries are ignored to avoid having to go down a possibly infinite chain.

### 2.5. Additional extracted information

In addition to extracting the keyphrases, several other information are extracted from the database dump. A list of all articles titles and their contents are extracted. Similar to what was done during the keyphrase extraction, all non-alphanumeric characters are removed from the article contents at the time of extraction. The content of an article is used to compute its similarity with an input document, as discussed in the later chapters. Moreover, for each article, a set of incoming links is also maintained. The incoming links of an article is the set of all articles in Wikipedia that have free-links to this article. This information is used to compute the semantic relatedness between two articles on Wikipedia.

### 2.6. Storage and retrieval

The amount of data in the knowledge-base is summarized below.

• 7 million articles
• 10.5 million distinct keyphrases
• 33 million distinct words

Each of the 7mn articles have a title and content and the total size of the contents is approximately 27GB uncompressed. Each of the 10.5mn keyphrases have at least one target entity and every word in the vocabulary has its inverse document frequency score.

All this information are required at the time of wikification, and therefore, it is important to design a strategy for its efficient retrieval. The content of an article, as mentioned earlier, is necessary to compute its similarity with the input document. However, storing, for example, 27GB of data in the main memory, is avoided for the following reasons.

• Reading such huge amount of data from secondary storage into main memory during the start of wikification engine, is still a time-consuming operation.
• Only a very small proportion of the data that resides in the memory is required at any time by the wikification engine. Therefore, the rest of the data occupies precious space in the main memory.
• On a server where several applications run simultaneously such a wastage of main-memory adversely impacts the server’s scalability.

#### 2.7.1. Indexing the knowledge base

When a knowledge base is created, each distinct component of it is stored one per line. In other words, the keyphrase vocabulary is stored in a file with each line containing one keyphrase. Pairs of data are stored in separate files but in the same line number in both files. Consider, for example, the following keyphrase and its set of target entities.

austin stationAustin (Amtrak station), Austin (CTA Blue Line),

If the keyphrase austin station is stored at, say, line number 10 of the file keyphrases.txt, then its target entities Austin (Amtrak station) and Austin (CTA Blue Line) are stored in the same line number, i.e. line number 10 of the file, say, entities.txt.

Doing so, enables the entities of keyphrase to be accessed from the file based on the line number at which the keyphrase occurs. The same holds for article titles and their respective contents.

Now, the problem of accessing a random line in a text file needs to be addressed. Issuing an UNIX command such as sed -n <line>p <file> takes forever to fetch a line near the end of the file for very large files, as the complexity of this approach is O(s), where s is the size of the file in bytes (The command has to skip that many newline characters before the desired line number is reached).

#### 2.7.1.1. Indexing the lines in the file

The idea presented here, exploits both the the static nature of the knowledge-base and the ability of the file system to read the data at a random line-number from a file very efficiently.

For each line in the input file, an index is created, which contains the byte location at which the line starts. For example, the index for the first line in the file would be 0. In this fashion, for a file having N number of lines, N number of indices are created and stored in another file, namely, the index file.

Given the static nature of the information in the knowledge-base, the index seldom needs to be updated. Now any line in the file could be read efficiently with the help of the index. In Java, the class java.io.RandomAccessFile has a seek (long) method to randomly address a location in the file, specified in its argument.

The size of the index file of the article contents is ca. 80MB, which is only 0.28% of the size of the article contents file, and therefore, could be easily stored in the main memory, in an array, for example. On a machine running Core i7 QM-3720 with Sata 6 SSD as the secondary storage, reading 1000 articles randomly from the disk using this approach took about 0.5ms, with an average size of about 4KB per article.

## 3. Wikification

[go back]

### 3.1. Ranking keyphrases

The keyphraseness [3] of a phrase, measures the significance of the phrase in any document. This is the approach currently followed by the wikifier for ranking the keyphrases. The keyphrases found in the input document are ranked in decreasing order of their keyphraseness scores. The keyphraseness of a phrase $p$ is defined as follows.

$\text{keyphraseness}(p) = \displaystyle{\frac{|p_{\text{link}}|}{|\text{DF}(p)\vert}}$

Where, $|p_{\text{link}}|$ is the number of articles in Wikipedia in which the phrase $p$ appears as a link, and DF$(p)$ is the document frequency of p. It holds that $|p_{\text{link}} | \leq \text{DF}(p)$, and hence,

$0 \leq \text{keyphraseness}(p) \leq 1$

A phrase with keyphraseness of 1 would mean that it is linked wherever it appears, and hence, must be a significant phrase, whereas a phrase with lower keyphraseness would mean that it is seldom linked, and hence, is a keyphrase of less significance. Therefore, the phrases in the input document are sorted in descending order of their respective keyphraseness scores.

### 3.2. Ranking the entities

A keyphrase can refer to several target entities. The entities are ranked based on their local compatibility scores. The compatibility between a keyphrase $k$ and an entity $e$, denoted by $\text{CP}(k, e)$ is defined as follows.

$\text{CP}(k, e) = \displaystyle{\frac{\vec{m}\cdot\vec{e}}{|\vec{m}||\vec{e}|}}$

Where, $\vec{m}$ is a vector containing the $tf \times idf$ scores of the words in the local context of the keyphrase, $\vec{e}$ is a vector of $tf \times idf$ scores of the words in the entity, and $|\vec{m}|$ and $|\vec{e}|$ denote the normalization of the vectors $\vec{m}$ and $\vec{e}$, respectively.

### 3.3. Overview of the wikification procedure

1. The input document is case folded to all lowercase.
2. All non-alphanumeric characters in the input document are replaced with blank space.
3. The wikifier computes n-grams of up to 10 words in the modified document.
4. All the n-grams, not present in the set of keyphrases in the knowledge-base, are discarded.
5. The remaining n-grams (keyphrases) are ranked based on their keyphraseness scores, and only the top-ranking N number of keyphrases are retained.
6. The local context of each of the retained keyphrases is obtained. The local context of a keyphrase is the words surrounding it for a predetermined window size. The window size is set to 50 according to the results published in Pedersen et al. [3]
7. For each of the keyphrases, its set of target entities is fetched from the knowledge-base and the local compatibility between each of article’s content and the keyphrases’ local context is computed.
8. The entity that has the maximum cosine similarity with the local context of the keyphrase wins.
9. The keyphrases are then anchored to their corresponding winning articles on Wikipedia

## 4. Results

[go back]

### 4.1. Result metrics: Precision and Recall

Precision and recall are widespread measures for evaluating the quality of the result. Let $K_r$ be the number of keyphrases linked correctly, $K_i$ be the number of irrelevant identified phrases, and let $K_u$ be the number of significant but unidentified phrases in the result. Then the precision and recall are defined as follows.

Precision, $P = \displaystyle{\frac{K_r}{K_r + K_i}}$

and,

Recall, $R = \displaystyle{\frac{K_r}{K_r + K_u}}$

Both precision and recall are measured in percentage. The quality of a wikifier could be considered perfect if both $P$ and $R$ are 100%. In practice, however, attempt to increase either of the two may adversely affect the other.

### 4.2. Sample document and result

The result of wikifying a document is discussed below. Significant phrases that are identified and linked correctly are highlighted in green , and both insignificant phrases and significant but incorrectly linked phrases are highlighted in red . Significant phrases that are left unrecognized are highlighted in blue and the color yellow marks a keyphrase that was identified and linked correctly, but that was not necessary for the document. Note: Significant phrases in the document for which no relevant articles exist in Wikipedia at all, are not considered to be unrecognized, for obvious reasons.

#### 4.2.1. Result

Delhi Police has said it arrested Sreesanth, Ajit Chandila and Ankeet Chavan - all Rajasthan Royals bowlers - for the alleged fulfilling of promises made to bookmakers during this year’s IPL Neeraj Kumar, the commissioner of Delhi Police, provided a detailed explanation of its investigation and the charges against the players but said it had no evidence to suggest any other player, administrator or team owner was involved. It has registered cases under Indian Penal Code section 420 and 120B, which deal with fraud, cheating and criminal conspiracy. The police has identified the three matches where the alleged spot-fixing happened: against Pune Warriors on May 6, Kings XI Punjab on May 9 and Mumbai Indians on May 15. Kumar said the deal was for the bowlers to concede a specified minimum number of runs in a pre-decided over. He explained in detail how the deals were struck, how the players allegedly indicated to the bookmakers that the deal was on, and how they went on to concede those number of runs. He said the police has recordings of those tapped phone conversations.

#### 4.2.1.1. Precision and Recall for this result

For this result, $K_r$ = 16, $K_i$=5, $K_u$= 4;

Precision P = $\displaystyle{\frac{16}{16+5}}$ = $76.1$ %

and

Recall R = $\displaystyle{\frac{16}{16+4}}$ = $80$ %

Further discussion on the result is available in the detailed report.

## 5. Future work

[go back]

The knowledge base has a wealth of keyphrases for recognizing many, if not, all of the significant phrases in an input document. For example, in the above result, the wikifier did not recognize 4 significant phrases, in spite of the knowledge base containing all of them.

The reason for this is that these keyphrases were ranked too low. Han et. al.[1] propose a method called collective entity-linking by which this problem could be solved. Their paper use a method called evidence propagation to form a strong evidence based on the evidence obtained from other entities in the document. Using this, even a keyphrase that had a very low initial rank could win over another high-ranked keyphrase based on its new score obtained from the evidence. This would also possibly prevent unnecessary phrases from being identified, since such a phrase will have less evidence to be a part of the document.

## 6. About the code

### 6.1. Creating the knowledge-base

The code for creating the knowledge-base is found in the location http://ad-svn.informatik.uni-freiburg.de/student-projects/ragavan-natarajan/knowledge-base/. The file README.txt in this folder contains the information on how to run the code. It is not repeated here for brevity's sake.

### 6.2. Code for the web application

The files constituting the web application is present in this folder. http://ad-svn.informatik.uni-freiburg.de/student-projects/ragavan-natarajan/java/web-app/. Here too, please refer to the README.txt for instructions. Test cases could be run by typing ant test from within this folder.

## Bibliography

[go back]

[1] Xianpei Han, Le Sun, and Jun Zhao. Collective entity linking in web text: a graph-based method.
pages 765–774, 2011. doi: 10.1145/2009916.2010019.

[2] Rada Mihalcea and Andras Csomai. Wikify!: linking documents to encyclopedic knowledge.
pages 233–242, 2007. doi: 10.1145/1321440.1321475.

[3] Ted Pedersen, Amruta Purandare, and Anagha Kulkarni. Name discrimination by clustering similar contexts.
pages 226–237, 2005. doi: 10.1007/978-3-540-30586-6 24.