Information Extraction

Information Extraction is a process which takes unseen texts as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in Information Retrieval (IR) applications.


Types of IE

There are five types of information extraction (or information extraction tasks) currently under R&D (as defined by the leading forum for this research, the Message Understanding Conferences [GS96]).

Named Entity recognition (NE)

Finds and classifies names, places etc.
Coreference Resolution (CO)

Identifies identity relations between entities in texts.
Template Element construction (TE)

Adds descriptive information to NE results (using CO)
Template Relation construction (TR)

Finds relations between TE entities.
Scenario Template production (ST)

Fits TE and TR results into specified event scenarios.

From a user point-of-view, NE, TE, TR and ST are the most relevant IE tasks (CO, as noted below, is necessary as an adjunct to the other tasks, but is of limited direct usefulness to the IE system user). NE, TE, TR and ST provide progressively higher-level information about texts. Each of the four types of IE have been the subject of rigorous performance evaluation, so it is possible to say quite precisely how well the current level of technology performs. Below we will quote percentage figures quantifying performance levels - they should be interpreted as a combined measure of precision and recall (see the section on evaluation in [ARP95]).

The performance of each IE task, and the ease with which it may be developed, is to varying degrees dependent on:

Text type:
the kinds of texts we are working with, for example Wall Street journal articles, or email messages, or HTML documents from the WORL Wide Web.
the broad subject matter of those texts, e.g. financial news, or requests for technical support, or tourist information.
the particular event types that the IE user is interested in, for example mergers between companies, or problems experienced with a particular software package, or descriptions of how to locate parts of a city.

For example, a particular IE application might be configured to process financial news articles from a particular news provider and find information about mergers between companies and various other scenarios. The performance of the application would be predictable for only this conjunction of factors. If it was later required to extract facts from the love letters of Napoleon Bonaparte as published on wall posters in the 1871 Paris Commune, performance levels would no longer be predictable. Tailoring an IE system to new requirements is a task that varies in scale dependent on the degree of variation in the three factors listed above.


Named Entity recognition

The simpliest and most reliable IE technology is Named Entity recognition (NE). NE systems identify all the names of people, places, organisations, dates, and amounts of money.

NE recognition as defined in MUC-7 can be performed at 96% accuracy. Given that human annotators do not perform to the 100% level (measured in MUC by inter-annotator comparisons), NE recognition can now be said to function at human performance levels, and applications of the technology are increasing rapidly as a result.

The process is weakly domain dependent, i.e. changing the subject matter of the texts being processed from financial news to other types of news would involve some changes to the system, and changing from news to scientific papers would involve quite large changes.


Coreference resolution

Coreference resolution (CO) involves identifying identity relations between entities in texts. These entities are both those identified by NE recognition and anaphoric references to those entities. For example, in

Alas, poor Yorick, I knew him well.

coreference resolution would tie "Yorick" with "him" (and "I" with Hamlet, if that information was present in the surrounding text).

This process is less relevant to users than other IE tasks (i.e. whereas the other tasks produce output that is of obvious utility for the application user, this task is more relevant to the needs of the application developer). CO technology can also be used to make links between documents. The main significance of this task, however, is as a building block for TE and ST. CO enables the association of descriptive information scattered across texts with the entities to which it refers. To continue the hackneyed Shakespeare example, coreference resolution might allow us to situate Yorick in Denmark.

CO resolution is an imprecise process when applied to the solution of anaphoric reference Coreference scores are typically low, but note that this hides the difference between proper noun coreference identification (same object, different spelling or compounding, e.g. "IBM", "IBM Europe", "International Business Machines Ltd.", ...) and anaphora resolution, the former being a significantly easier problem, performed at about 95%.


Template Element production

The TE task builds on NE recognition and coreference resolution. In addition to locating and typing (i.e. classifying, or assigning to a type - personal name, date etc.) entities in documents, TE associates descriptive information with the entities.

TE is essentially a database record, and could just as well be formatted for SQL store operations, or reading into a spreadsheet, or (with some extra processing) for multilingual presentation.

As in NE recognition, the production of TEs is weakly domain dependent, i.e. changing the subject matter of the texts being processed from financial news to other types of news would involve some changes to the system, and changing from news to scientific papers would involve quite large changes.


Scenario Template extraction

Scenario templates (STs) are the prototypical outputs of IE systems. They tie together TE entities into event and relation descriptions. For example, TE may have identified Isabelle, Dominique and Francoise as people entities present in the Robert edition of Napoleon's love letters. ST might then identify facts such as that Isabelle moved to Paris in August 1802 from Lyon to be nearer to the little chap, that Dominique then burnt down Isabelle's apartment block and that Francoise ran off with one of Gerard Depardieu's ancestors.

ST is a difficult IE task. It is possible to increase precision at the expense of recall: we can develop ST systems that don't make many mistakes, but that miss quite a lot of occurrences of relevant scenarios. Alternatively we can push up recall and miss less, but at the expense of making more mistakes.

The ST task is both domain dependent, and, by definition, tied to the scenarios of interest to the users. Note however that the results of NE and TE feed into ST.

[App99] Appelt, D.: An Introduction to Information Extraction. Artificial Intelligence Communications, 1999.
[CWG96] Cunningham, H. / Wilks, Y. / Gaizauskas, R.J.: New Methods, Current Trends and Software Infrastructure for NLP. In Proceedings of the Conference on New Methods in Natural Language Processing (NeMLaP-2), Bilkent University, Turkey, September 1996.
[GS96]Grisham, R. / Sundheim, B.: Message understanding conference - 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, June 1996.

For more information on this topic see also the relevant chapter in HLT-Survey.