Semantic Annotation

In the semantic web context, web documents are marked up with metadata, using manual annotation with web-based knowledge representation languages such as RDF and DAML+OIL for describing the content of the document. Some ongoing projects along these lines are SHOE [1], COHSE [2], and OntoAnnotate [3], all of which aim at motivating people to richly annotate electronic documents in order to turn them into a machine-understandable format, and at developing and spreading annotation-aware applications such as content-based information presentation and retrieval

From a somewhat different angle, automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction. Natural language applications, such as information extraction and machine translation, require a certain level of semantic analysis. An important part of this process is semantic tagging: the annotation of each content word with a semantic category. Semantic categories are assigned on the basis of a semantic lexicon like WordNet for English [4] or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

Especially from a domain specific point of view, these separate developments now seem to converge on a common goal of relating textual units in documents to information organized in structured ways. In the following sections we will take a closer look at some of the semantic resources that are available and their use in semantic tagging. Special emphasis is given also to the important subtask of sense disambiguation, which is needed if a word or term corresponds to more than one possible semantic class. Finally, semantic tagging of terms (and of relations between terms) is illustrated by an example from the MuchMore annotation.

Semantic Resources

Semantic knowledge is captured in resources like dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. They can be roughly distinguished into the following three groups:


Roget is a thesaurus of English words and phrases. It groups words in synonym categories or concepts. Besides synonyms also antonyms are covered. A sample categorization (for the concept Feeling) is:

Words Expressing Abstract Relations
Words Relating To Space
Words Relating To Matter
Words Relating To The Intellectual Faculties; Formation and Communication of Ideas
Words Relating To The Voluntary Powers; Individual And Intersocial Volition
Words Relating To The Sentiment and Moral Powers


    warmth, glow, unction, vehemence;
fervor, fervency;
heartiness, cordiality;
earnestness, eagerness;
empressment, gush, ardor, zeal, passion, ...

MeSH (Medical Subject Headings) is a thesaurus for indexing articles and books in the medical domain, which may then be used for searching MeSH-indexed databases [5]. MeSH provides for each term a number of term variants that refer to the same concept. It currently includes a vocabulary of over 250,000 terms. The following is a sample entry for the term gene library (MH is the term itself, ENTRY are term variants):

MH = Gene Library
ENTRY = Bank, Gene
ENTRY = Banks, Gene
ENTRY = DNA Libraries
ENTRY = Gene Banks
ENTRY = Gene Libraries
ENTRY = Libraries, DNA
ENTRY = Libraries, Gene
ENTRY = Library, DNA
ENTRY = Library, Gene

Semantic Lexicons

WordNet has primarily been designed asa computational account of the human capacity of linguistic categorization. It therefore covers a rather extensive set of semantic classes (called synsets), currently over 90.000. Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. For instance, a board and a plank are similar lexical items and can thus be grouped together in the synset: {board, plank}. At the same time, however, board also refers to a group of people, which may be represented by the synset {board, commitee}. So, in fact synsets define lexical meaning in an implicit way, contrary to using explicit definitions.

Synsets range from the very specific to the very general. Very specific synsets typically cover only a small number of lexical items, while very general ones tend to cover many. The following example for 'tree' illustrates how in WordNet the hyponymy relation is used. The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes:

         09396070 tree 0
     09395329 woody_plant 0 ligneous_plant 0
       09378483 vascular_plant 0 tracheophyte 0
        00008864 plant 0 flora 0 plant_life 0
         00002086 life_form 0 organism being 0 living_thing 0
             00001740 entity 0 something 0

         10025462 tree 0 tree_diagram 0
     09987563 plane_figure 0 two-dimensional_figure 0
       09987377 figure 0
         00015185 shape 0 form 0
           00018604 attribute 0
             00013018 abstraction 0

EuroWordNet is a multilingual semantic lexicon for several European languages and is structured in similar ways to WordNet. Each language specific (Euro)WordNet is linked to all others through the Inter-Lingual-Index (ILI), which is based on WordNet1.5. Via this index the languages are interconnected, so that it is possible to move from a word in one language to similar words in any of the other languages in the EuroWordNet semantic lexicon.

Semantic Networks

UMLS is one of the most extensive semantic resources available. It is based in part on the MeSH thesaurus and is specific to the medical domain. UMLS integrates linguistic, terminological and semantic information in three corresponding parts: the Specialist Lexicon, the Metathesaurus and the Semantic Network.

The Metathesaurus is a multilingual thesaurus that groups term variants together that correspond to the same concept, for instance the following term variants in several languages for the concept C0019682 (HIV):

C0019682 ENG HIV
C0019682 ENG Human Immunodeficiency Virus
C0019682 ENG Virus, Human Immunodeficiency
C0019682 GER HIV
C0019682 GER Humanes T-Zell-lymphotropes Virus Typ III

The Semantic Network organises all concepts in the Metathesaurus into 134 semantic types and 54 relations between semantic types. Relations between semantic types are represented in the form of triplets, with two semantic types linked by one or more relations:

Pharmacologic Substance affects Pathologic Function
Pharmacologic Substance causes Pathologic Function
Pharmacologic Substance complicates Pathologic Function
Pharmacologic Substance diagnoses Pathologic Function
Pharmacologic Substance prevents Pathologic Function
Pharmacologic Substance treats Pathologic Function

CYC is a semantic network of over 1,000,000 manually defined rules that cover a large part of common sense knowledge about the world [6]. For example, CYC knows that trees are usually outdoors, or that people who died stop buying things. Each concept in this semantic network is defined as a constant, which can represent a collection (e.g. the set of all people), an individual object (e.g. a particular person), a word (e.g. the English word person), a quantifier (e.g. there exist), or a relation (e.g. predicate, function, slot, attribute). Consider for instance the entry for the predicate #$mother:

#$mother :
        (#$mother ANIM FEM)
                 isa: #$FamilyRelationSlot #$BinaryPredicate

This says that the predicate #$mother takes two arguments, the first of which must be an element of the collection #$Animal, and the second of which must be an element of the collection #$FemaleAnimal.

Further semantic networks used in semantic tagging are: Mikrokosmos [7] and Sensus [8].

Sense Disambiguation

Words mostly have more than one interpretation, or sense. If natural language were completely unambiguous, there would be a one-to-one relationship between words and senses.

In fact, things are much more complicated, because for most words not even a fixed number of senses can be given. Therefore, only in certain circumstances and depending on what we mean exactly with sense, can we give restricted solutions to the problem of Word Sense Disambiguation (WSD).


WSD involves two parts, a semantic lexicon that associates words with sets of possible semantic classes (i.e. senses) and a method of associating (annotating, tagging) occurrences of these words with one or more of its senses. The systems and algorithms that have been developed for this cover the full spectrum of methods developed in natural language processing, artificial intelligence and more recently also of machine learning. For our purposes here, we may group these as follows: Knowledge-based, Hybrid, and Empirical.

In knowledge-based approaches, the construction of the tag set (the senses used and their association with word types in the semantic lexicon) and the tagging (disambiguation between possible senses and association of the preferred sense with a given word token) are both supervised. These approaches use small, but deep, handcrafted lexicons to analyse a small number of examples in a non-robust way, that is, the systems can handle only certain input (see for instance: [9] [10] [11] [12]). All of these rely on pre-coded, domain-specific knowledge, which is the heaviest cost factor in work on WSD, yet indispensable (knowledge acquisition bottleneck). Typically, handcrafted rules are constructed, according to given examples. There is no automatic training of the system.

In hybrid approaches, the construction of the tag set is supervised, but the training for the tagging can be either supervised or unsupervised. In the first case, corpora that have been manually annotated are used for training. In the second case, corpora that have not been annotated are used for training. These approaches combine hand-crafted knowledge bases with empirical data derived from large corpora. These approaches use large scale, more shallow, hand-crafted lexicons (WordNet, Roget, lexical database versions of standard English dictionaries like LDOCE, OALD, etc.) to analyse text in a robust way, that is, the systems can handle free, naturally occurring text. Most of these systems became possible thanks to technological advances, using large-scale, machine-readable dictionaries [13], thesauri such as Roget's [14], and computational dictionaries like WordNet [15], [16].

In empirical approaches, the construction of the tag set and the training for the tagging are both unsupervised. These approaches use no external knowledge base at all, but instead derive the tag set itself from the corpus as well. This might be called self-organizing WSD. It seeks to do without the pre-defined set of alternative senses, inferring them instead by working, as it were, in the opposite direction: A corpus is used to classify words based solely on patterns of occurrence. The resulting clusters are presumed to represent senses. The two stages that this process consists of are (1) clustering the occurrences of a word into a number of categories and (2) assigning a sense to each category. This idea was first discussed in [17]. It can dispense with the sense-labeling step if the results are only used machine-internally. To stress this subtle but important difference, the method is called Word Sense Discrimination in [18]. However, a notable problem with such methods is the close dependence of the resulting classification on the training corpus and the choice of clustering granularity.


A problem with the various methods proposed for WSD is the lack of a standardized evaluation metric. Publications often focus on only a few words. In a special issue on WSD of the journal for Computational Linguistics two out of four articles concentrate exclusively on the three words line, serve, and hard, while many other papers use some other small set of words [19]. In hand-tagged corpora, disagreement between human judges invariably introduces considerable noise [20] [21]. Moreover, it is difficult to draw comparisons across domains. These problems motivated the SENSEVAL [22] and SENSEVAL2 [23] competitions, aimed at developing a standardized evaluation metric for WSD systems.

 [1] Heflin, J. / Hendler, J. / Luke, S.: SHOE: A Knowledge Representation Language for Internet Applications. Technical Report CS-TR-4078. Department of Computer Science, University of Maryland, 1999.
 [2] Bechhofer, S. / Goble, C.: Towards Annotation using DAML+OIL. Communications of the ACM, 2000.
 [3] Staab, S. / Maedche, A. / Handschuh, S.: An Annotation Framework for the Semantic Web. In The First International Workshop on Multimedia Annotation, Tokyo, Japan, 2001.
[4] Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM 11. 1995.

[7] Mahesh, Kavi / Nirenburg, Sergei: A situated ontology for practical NLP. In Proceedings of IJCAI-95 Workshop on Basic Ontological Issues in Knowledge Sharing 1995.


[8] Knight, K., Luk: Building a Large Knowledge Base for Machine Translation. Proceedings of the American Association of Artificial Intelligence Conference AAAI-94. Seattle, WA. 1994.

[9] Small, S.L.: Word Expert Parsing: A Theory of Distributed Word-based Natural Language Understanding. Ph.D. thesis, The University of Maryland, Baltimore, MD.1980.
[10] Small, S.L.: Parsing as cooperative distributed inference. In King, M. (ed.): Parsing Natural Language. Academic Press, London. 1983.
[11] Hirst, G.: Semantic Interpretation and the Resolution of Ambiguity. Cambridge University Press. 1988.
 [12] Adriaens, G. / Small, S.L.: Word expert revisited in a cognitive science perspective. In Perspectives from Psycholinguistics, Neuopsychology, and Artificial Intelligence. Morgan Kaufmann, San Mateo, CA, pages 13-43. 1988.

[13] Lesk, M.E.: Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cone. In Proceedings of the SIGDOC Conference. 1986.

[14] Yarowsky, D.: Word-sense disambiguation using statistical models of Roget"s categories. In Proceedings of COLING-92, Nantes, France. 1992.
[15] Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). 1995.

[16] Ng, H.T. / Lee, H.B.: Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In Proceedings of ACL96. 1996.

[17] Schuetze, H.: Context space. In Goldman, R. / Norvig, P. / Charniak, E. / Gale, B. (eds.): Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language, AAAI Press, Menlo Park, CA, pages 113-120. 1992.

[18] Schuetze, H.: Automatic word sense discrimination. Compuational Linguistics, 24(1):97-123. 1998.

[19] Ide, N. / Veronis, J.: Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1--40. 1998.
 [20] Kilgariff, A. 1998: Gold standard datasets for evaluating word sense disambiguation programs. Computer Speech and Language 12(4), Special Issue on Evaluation.
[21] Veronis, J. 1998: A study of polysemy judgements and inter-annotator agreement. In Programme and advanced papers of the Senseval workshop, Herstmonceux Castle (England), pages 2-4.
[22] Kilgariff, A. / Palmer, M. 2000: Introduction to the special issue on SENSEVAL. Computers and the Humanities 34(1/2):1-13.
 For more information on this topic see also the relevant chapter in HLT-Survey.