An Efficient and Flexible Format for Linguistic and Semantic Annotation

Spela Vintar  , Paul Buitelaar  , Brbel Ripplinger * ,
Bogdan Sacaleanu  , Diana Raileanu  , Detlef Prescher 
 DFKI GmbH
Stuhlsatzenhausweg 3,
66123 Saarbrcken, Germany
{ vintar, paulb, bogdan, raileanu, prescher}@dfki.de
* Eurospider Information Technology AG
Schaffhauserstrasse 18
CH8006 Zrich, Switzerland
ripplinger@eurospider.ch

Abstract
The paper describes an XML annotation format and tool developed within the MUCHMORE project. The annotation scheme was
designed specifically for the purposes of CrossLingual Information Retrieval in the medical domain so as to allow both efficient and
flexible access to layers of information. We use a parallel EnglishGerman corpus of medical abstracts and annotate it with linguistic
information (tokenisation, partofspeech tagging, lemmatisation and decomposition, phrase recognition, grammatical functions) as
well as semantic information from various sources. The annotation of medical terms/concepts, semantic types and semantic relations is
based on the Unified Medical Language System (UMLS). Additionally, we use EuroWordNet as a generallanguage resource in
annotating word senses and to compare domainspecific and general language use. A major aim of the project is also to complement
existing ontological resources by extracting new terms and new semantic relations. We present the annotation scheme, which is
conceptually related to standoff annotation, and describe our tool for automatic semantic annotation.

References
Bird S. and Liberman, M., 2001. A Formal Framework for
Linguistic Annotation. Speech Communication 33 (1,2),
2360.
Brants, T., 2000. TnT  A Statistical PartofSpeech
Tagger. In: Proceedings of 6 th ANLP Conference,
Seattle, WA.
Dybkjr, L., Bernsen, N.O., Dybkjr, H., McKelvie, D.
and Mengel, A. 1998. The MATE Markup Framework.
MATE Deliverable D1.2, November 1998.
http://mate.nis.sdu.dk/information/d12/
Ide, N., Bonhomme, P., Romary, L. 2000. XCES: An
XMLbased Standard for Linguistic Corpora..
Proceedings of the Second Language Resources
and Evaluation Conference (LREC), Athens, Greece,
82530.
McKelvie D., Brew C. and Thompson H. 1997. Using
SGML as a Basis for DataIntensive NLP. In
Proceedings of ANLP97, Washington, DC.
Petitpierre, D. and Russell, G. 1995. MMORPH  The
Multext Morphology Program. Multext deliverable
report for the task 2.3.1, ISSCO, University of Geneva.
Ripplinger B., Vintar S. and Buitelaar P. 2002. Cross
Lingual Medical Information Retrieval through
Semantic Annotation. Proceedings of the EFMI
Workshop on Natural Language Processing in
Biomedical Applications, EFMI, ??.
Skut W. and Brants T. 1998. A Maximum Entropy partial
parser for unrestricted text. In Proceedings of the 6th
ACL Workshop on Very Large Corpora (WVLC),
Montreal.
Thompson H. and McKelvie D. 1997. Hyperlink
semantics for standoff markup of readonly documents.
In Proceedings of SGML Europe 97, Barcelona.
Vossen, P. 1997. EuroWordNet: a multilingual database
for information retrieval. In: Proceedings of the
DELOS workshop on Crosslanguage Information
Retrieval, March 57, 1997, Zurich.
from