Supervisor:
1. Introduction
2. Knowledge-base creation
3. Wikification
4. Results
5. Future work
6. About the code
Wikification is the process of identifying the important phrases in a document and linking each of them to appropriate articles on Wikipedia based on their context of occurrences. The important phrases in a document are also called keyphrases, somewhat similar to the term keyword, but unlike a keyword , a keyphrase can consist of one or more words. A wikifier is a software that performs the wikification of a document. One such software has been developed and made available here. This page discusses how it was developed from the ground-up. A more detailed report is available here.
Wikification can be broadly classified into the following two stages.
The Wikipedia database dump file containing the latest pages and articles could be downloaded from here. It contains only the current revisions - exactly what is needed - and no talk or user pages. The Perl module MediaWiki::DumpFile::FastPages , available in CPAN, has been used to parse the uncompressed database dump. It enables simultaneous iteration over the title and content of each page in the database dump in a sequential manner.
In Wikipedia, free links are used to produce internal links between pages, which enable users to access information related to the article they are reading. They are created by enclosing a page-title in double square brackets. Thus, enclosing the word Chennai in double square brackets, like this, [[Chennai]], will cause the text to be anchored with a hyperlink to the respective Wikipedia article, i.e. Chennai, in this case.
Optionally a vertical bar | could be added to customize the anchor text. For example, typing [[Chennai | the capital city of Tamilnadu]] would result in the text the capital city of Tamilnadu being anchored to the respective Wikipedia article, which is, in this case, the aforementioned article.
As mentioned before, a keyphrase can be composed of one or more words, but for reasons mentioned later in this section further restrictions are imposed on keyphrases. The following set of rules restrict what a keyphrase can be composed of.
These restrictions on the keyphrase offer the following benefits:
The set of target entities of two keyphrases, which are essentially the same but are under different case foldings could then be combined. Consider two keyphrases austin and Austin each having their own set of referred entities. Let the referred entities of the keyphrase austin be Austin Island and Austin College, and let that of Austin be Austin College and Austin Motor Company.
Applying the rules, the sets of entities of the two keyphrases could then be merged with the case-folded austin acting as the keyphrase with referred entities being Austin Island, Austin College and Austin Motor Company.
Eliminating the non-alphanumeric characters in the keyphrases provides an uniform way of recognizing the keyphrases in an input document at the time of wikification. Consider, for example, the following excerpt in an input document.
... Alaska’s residents grapple with changing climate ...
The rules for keyphrase extraction when applied to the input document case-folds it to lowercase and replaces each of the non-alphanumeric characters in it with a single whitespace , causing the above excerpt to become as follows.
... alaska s residents grapple with changing climate ...
This makes it possible to recognize the word alaska in the document if such a keyphrase existed in the keyphrase vocabulary. Imagine, without these rules, the entire word would be Alaska’s which may not contain a matching keyphrase in the keyphrase-vocabulary of the knowledge-base.
For free links without a vertical bar, i.e. without |, the enclosed text would act both as the keyphrase and the target entity. For example, for the free link [[Chennai ]], the extracted keyphrase and target entity would be the following.
chennai → Chennai
For free links with a vertical bar, the text before the bar, i.e. the entity text, acts as the target entity whereas the text that follows it acts as the keyphrase. In addition to that, the entity text is also used as a keyphrase, just like it was used in a free link without vertical bar.
For example, for the free link [[Chennai|the capital city of Tamilnadu]], the following keyphrase - target entity pairs are extracted.
the capital city of tamilnadu → Chennai
chennai → Chennai
Doing so, helps improve the keyphrase vocabulary, which in turn could increase the possibility of important phrases in a document being recognized at the time of wikification.
As the parser iterates the article titles, keyphrases are extracted from the titles and the titles are added to the set of target entities of those keyphrases. Any information enclosed in parenthesis is discarded. For example, if a page title is Casablanca (film), then the following keyphrase − target entity pair is extracted from it.
casablanca → Casablanca (film),
If another page title, such as Casablanca (volcano), is encountered, then another keyphrase − target entity pair is generated with the same keyphrase, i.e. casablanca causing it to have two different target entities.
casablanca → Casablanca (film), Casablanca (volcano)
It is necessary to understand why the text within parentheses is discarded. If it were not, then the generated keyphrase would, for example, be casablanca film. It is very unlikely that an input document has this exact word sequence. By discarding the text within parentheses, the chances of a phrase in an input document being found in the keyphrase vocabulary increases. Since there are algorithms to disambiguate entities in the later stages of wikification, this does not cause any problem. Even if the word sequence casablanca film appeared in an input document, the wikifier would still be able to identify the phrase casablanca in it.
The keyphrase obtained from the title of a redirect page could not use the title as a target entity, since the title redirects to another page on Wikipedia. Therefore, the keyphrase is assigned to the title in the body of the redirect page (hereinafter referred to as the redirect target). Additionally, a keyphrase is extracted from the redirect target and assigned to it. For example, the title Air Transport on Wikipedia redirects to the title Aviation. In this case, the following keyphrase − target entity pairs are extracted.
air transport → Aviation
aviation → Aviation.
However, in order to avoid chain-redirects, if a redirect target redirects to another page, no keyphrase − target entity pairs are obtained.
A keyphrase obtained from the title of a disambiguation page has, as its target, each of the articles available in the page. For example, the title Casablanca (disambiguation) is that of a disambiguation page on Wikipedia, containing the following entities.
[[Casablanca (volcano)]]
[[Casablanca Records]]
[[Casablanca (film)]]
As done before, the keyphrase casablanca is extracted from the disambiguation title, and each of the entities in the disambiguation page is assigned to be target of this keyphrase, as shown below.
casablanca → Casablanca (volcano), Casablanca Records, Casablanca (film)
It is possible for an entry in a Wikipedia disambiguation page to link to another disambiguation page. For example, the page Austin (disambiguation) on Wikipedia, contains the following entry in its content, which is the title of another disambiguation page.
[[Austin Station (disambiguation)]]
A disambiguation entity cannot be assigned as the target of a keyphrase. Therefore, the page of the nested disambiguation entity is fetched and all the non-disambiguation entities in its content are assigned to this keyphrase. For example, if the page Austin Station (disambiguation) contained the following content,
[[Austin (Amtrak station)]]
[[Austin (CTA Blue Line)]]
[[Austin (CTA Green Line)]]
austin → Austin (Amtrak station), Austin (CTA Blue Line), Austin (CTA Green Line)
In addition to that, obviously, as done before, the keyphrase austin station also gets all these target entities, as shown below.
austin station → Austin (Amtrak station), Austin (CTA Blue Line), Austin (CTA Green Line)
It is possible for a nested disambiguation page to further contain links to disambiguation pages. Such entries are ignored to avoid having to go down a possibly infinite chain.
In addition to extracting the keyphrases, several other information are extracted from the database dump. A list of all articles titles and their contents are extracted. Similar to what was done during the keyphrase extraction, all non-alphanumeric characters are removed from the article contents at the time of extraction. The content of an article is used to compute its similarity with an input document, as discussed in the later chapters. Moreover, for each article, a set of incoming links is also maintained. The incoming links of an article is the set of all articles in Wikipedia that have free-links to this article. This information is used to compute the semantic relatedness between two articles on Wikipedia.
The amount of data in the knowledge-base is summarized below.
Each of the 7mn articles have a title and content and the total size of the contents is approximately 27GB uncompressed. Each of the 10.5mn keyphrases have at least one target entity and every word in the vocabulary has its inverse document frequency score.
All this information are required at the time of wikification, and therefore, it is important to design a strategy for its efficient retrieval. The content of an article, as mentioned earlier, is necessary to compute its similarity with the input document. However, storing, for example, 27GB of data in the main memory, is avoided for the following reasons.
When a knowledge base is created, each distinct component of it is stored one per line. In other words, the keyphrase vocabulary is stored in a file with each line containing one keyphrase. Pairs of data are stored in separate files but in the same line number in both files. Consider, for example, the following keyphrase and its set of target entities.
austin station → Austin (Amtrak station), Austin (CTA Blue Line),
If the keyphrase austin station is stored at, say, line number 10 of the file keyphrases.txt, then its target entities Austin (Amtrak station) and Austin (CTA Blue Line) are stored in the same line number, i.e. line number 10 of the file, say, entities.txt.
Doing so, enables the entities of keyphrase to be accessed from the file based on the line number at which the keyphrase occurs. The same holds for article titles and their respective contents.
Now, the problem of accessing a random line in a text file needs to be addressed. Issuing an UNIX command such as sed -n <line>p <file> takes forever to fetch a line near the end of the file for very large files, as the complexity of this approach is O(s), where s is the size of the file in bytes (The command has to skip that many newline characters before the desired line number is reached).
The idea presented here, exploits both the the static nature of the knowledge-base and the ability of the file system to read the data at a random line-number from a file very efficiently.
For each line in the input file, an index is created, which contains the byte location at which the line starts. For example, the index for the first line in the file would be 0. In this fashion, for a file having N number of lines, N number of indices are created and stored in another file, namely, the index file.
Given the static nature of the information in the knowledge-base, the index seldom needs to be updated. Now any line in the file could be read efficiently with the help of the index. In Java, the class java.io.RandomAccessFile has a seek (long) method to randomly address a location in the file, specified in its argument.
The size of the index file of the article contents is ca. 80MB, which is only 0.28% of the size of the article contents file, and therefore, could be easily stored in the main memory, in an array, for example. On a machine running Core i7 QM-3720 with Sata 6 SSD as the secondary storage, reading 1000 articles randomly from the disk using this approach took about 0.5ms, with an average size of about 4KB per article.
The keyphraseness [3] of a phrase, measures the significance of the phrase in any document. This is the approach currently followed by the wikifier for ranking the keyphrases. The keyphrases found in the input document are ranked in decreasing order of their keyphraseness scores. The keyphraseness of a phrase $p$ is defined as follows.
$\text{keyphraseness}(p) = \displaystyle{\frac{|p_{\text{link}}|}{|\text{DF}(p)\vert}}$
Where, $|p_{\text{link}}|$ is the number of articles in Wikipedia in which the phrase $p$ appears as a link, and DF$(p)$ is the document frequency of p. It holds that $|p_{\text{link}} | \leq \text{DF}(p)$, and hence,
$0 \leq \text{keyphraseness}(p) \leq 1$
A phrase with keyphraseness of 1 would mean that it is linked wherever it appears, and hence, must be a significant phrase, whereas a phrase with lower keyphraseness would mean that it is seldom linked, and hence, is a keyphrase of less significance. Therefore, the phrases in the input document are sorted in descending order of their respective keyphraseness scores.
A keyphrase can refer to several target entities. The entities are ranked based on their local compatibility scores. The compatibility between a keyphrase $k$ and an entity $e$, denoted by $\text{CP}(k, e)$ is defined as follows.
$\text{CP}(k, e) = \displaystyle{\frac{\vec{m}\cdot\vec{e}}{|\vec{m}||\vec{e}|}}$
Where, $\vec{m}$ is a vector containing the $tf \times idf$ scores of the words in the local context of the keyphrase, $\vec{e}$ is a vector of $tf \times idf$ scores of the words in the entity, and $|\vec{m}|$ and $|\vec{e}|$ denote the normalization of the vectors $\vec{m}$ and $\vec{e}$, respectively.Precision and recall are widespread measures for evaluating the quality of the result. Let $K_r$ be the number of keyphrases linked correctly, $K_i$ be the number of irrelevant identified phrases, and let $K_u$ be the number of significant but unidentified phrases in the result. Then the precision and recall are defined as follows.
Precision, $P = \displaystyle{\frac{K_r}{K_r + K_i}}$
and,
Recall, $R = \displaystyle{\frac{K_r}{K_r + K_u}}$
Both precision and recall are measured in percentage. The quality of a wikifier could be considered perfect if both $P$ and $R$ are 100%. In practice, however, attempt to increase either of the two may adversely affect the other.
The result of wikifying a document is discussed below. Significant phrases that are identified and linked correctly are highlighted in green , and both insignificant phrases and significant but incorrectly linked phrases are highlighted in red . Significant phrases that are left unrecognized are highlighted in blue and the color yellow marks a keyphrase that was identified and linked correctly, but that was not necessary for the document. Note: Significant phrases in the document for which no relevant articles exist in Wikipedia at all, are not considered to be unrecognized, for obvious reasons.
Delhi Police has said it arrested Sreesanth, Ajit Chandila and Ankeet Chavan - all Rajasthan Royals bowlers - for the alleged fulfilling of promises made to bookmakers during this year’s IPL Neeraj Kumar, the commissioner of Delhi Police, provided a detailed explanation of its investigation and the charges against the players but said it had no evidence to suggest any other player, administrator or team owner was involved. It has registered cases under Indian Penal Code section 420 and 120B, which deal with fraud, cheating and criminal conspiracy. The police has identified the three matches where the alleged spot-fixing happened: against Pune Warriors on May 6, Kings XI Punjab on May 9 and Mumbai Indians on May 15. Kumar said the deal was for the bowlers to concede a specified minimum number of runs in a pre-decided over. He explained in detail how the deals were struck, how the players allegedly indicated to the bookmakers that the deal was on, and how they went on to concede those number of runs. He said the police has recordings of those tapped phone conversations.
For this result, $K_r$ = 16, $K_i$=5, $K_u$= 4;
Precision P = $\displaystyle{\frac{16}{16+5}}$ = $76.1$ %
and
Recall R = $\displaystyle{\frac{16}{16+4}}$ = $80$ %
Further discussion on the result is available in the detailed report.
The knowledge base has a wealth of keyphrases for recognizing many, if not, all of the significant phrases in an input document. For example, in the above result, the wikifier did not recognize 4 significant phrases, in spite of the knowledge base containing all of them.
The reason for this is that these keyphrases were ranked too low. Han et. al.[1] propose a method called collective entity-linking by which this problem could be solved. Their paper use a method called evidence propagation to form a strong evidence based on the evidence obtained from other entities in the document. Using this, even a keyphrase that had a very low initial rank could win over another high-ranked keyphrase based on its new score obtained from the evidence. This would also possibly prevent unnecessary phrases from being identified, since such a phrase will have less evidence to be a part of the document.
The code for creating the knowledge-base is found in the location http://ad-svn.informatik.uni-freiburg.de/student-projects/ragavan-natarajan/knowledge-base/. The file README.txt in this folder contains the information on how to run the code. It is not repeated here for brevity's sake.
[1] Xianpei Han, Le Sun, and Jun Zhao.
Collective entity linking in web text: a graph-based method.
pages 765–774, 2011. doi: 10.1145/2009916.2010019.
[2] Rada Mihalcea and Andras Csomai.
Wikify!: linking documents to encyclopedic knowledge.
pages 233–242, 2007. doi: 10.1145/1321440.1321475.
[3] Ted Pedersen, Amruta Purandare, and Anagha Kulkarni.
Name discrimination by clustering similar contexts.
pages 226–237, 2005. doi:
10.1007/978-3-540-30586-6 24.