Learning Dictionaries for Information Extraction by MultiLevel
Bootstrapping

Ellen Riloff
Department of Computer Science
University of Utah
Salt Lake City, UT 84112
riloff@cs.utah.edu
Rosie Jones
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
rosie@cs.cmu.edu

Abstract
Information extraction systems usually require two
dictionaries: a semantic lexicon and a dictionary of extraction 
patterns for the domain. We present a multi
level bootstrapping algorithm that generates both the
semantic lexicon and extraction patterns simultaneously. 
As input, our technique requires only unan
notated training texts and a handful of seed words
for a category. We use a mutual bootstrapping technique 
to alternately select the best extraction pattern
for the category and bootstrap its extractions into the
semantic lexicon, which is the basis for selecting the
next extraction pattern. To make this approach more
robust, we add a second level of bootstrapping (meta
bootstrapping) that retains only the most reliable lexicon 
entries produced by mutual bootstrapping and
then restarts the process. We evaluated this multilevel 
bootstrapping technique on a collection of corpo
rate web pages and a corpus of terrorism news articles.
The algorithm produced highquality dictionaries for
several semantic categories.



References
Califf, M. E. 1998. Relational Learning Techniques for
Natural Language Information Extraction. Ph.D. Dissertation, 
Tech. Rept. AI98276, Artificial Intelligence Labo
ratory, The University of Texas at Austin.
Craven, M.; DiPasquo, D.; Freitag, D.; McCallum, A.;
Mitchell, T.; Nigam, K.; and Slattery, S. 1998. Learning 
to Extract Symbolic Knowledge from the World Wide
Web. In Proceedings of the Fifteenth National Conference
on Artificial Intelligence.
Freitag, D. 1998. Toward GeneralPurpose Learning for
Information Extraction. In Proceedings of the 36th Annual
Meeting of the Association for Computational Linguistics.
Huffman, S. 1996. Learning information extraction patterns 
from examples. In Wermter, S.; Riloff, E.; and
Scheler, G., eds., Connectionist, Statistical, and Symbolic
Approaches to Learning for Natural Language Processing.
SpringerVerlag, Berlin. 246--260.
Kim, J., and Moldovan, D. 1993. Acquisition of Semantic 
Patterns for Information Extraction from Corpora. In
Proceedings of the Ninth IEEE Conference on Artificial
Intelligence for Applications, 171--176. Los Alamitos, CA:
IEEE Computer Society Press.
Miller, G. 1990. Wordnet: An Online Lexical Database.
International Journal of Lexicography 3(4).
MUC4 Proceedings. 1992. Proceedings of the Fourth
Message Understanding Conference (MUC4). San Mateo, 
CA: Morgan Kaufmann.
Riloff, E., and Shepherd, J. 1997. A CorpusBased Approach 
for Building Semantic Lexicons. In Proceedings of
the Second Conference on Empirical Methods in Natural
Language Processing, 117--124.
Riloff, E. 1993. Automatically Constructing a Dictionary
for Information Extraction Tasks. In Proceedings of the
Eleventh National Conference on Artificial Intelligence,
811--816. AAAI Press/The MIT Press.
Riloff, E. 1996a. An Empirical Study of Automated Dictionary 
Construction for Information Extraction in Three
Domains. Artificial Intelligence 85:101--134.
Riloff, E. 1996b. Automatically Generating Extraction 
Patterns from Untagged Text. In Proceedings of the
Thirteenth National Conference on Artificial Intelligence,
1044--1049. The AAAI Press/MIT Press.
Roark, B., and Charniak, E. 1998. Nounphrase Co-occurrence 
Statistics for Semiautomatic Semantic Lexicon 
Construction. In Proceedings of the 36th Annual
Meeting of the Association for Computational Linguistics,
1110--1116.
Soderland, S.; Fisher, D.; Aseltine, J.; and Lehnert, W.
1995. CRYSTAL: Inducing a conceptual dictionary. In
Proceedings of the Fourteenth International Joint Conference 
on Artificial Intelligence, 1314--1319.
Soderland, S. 1999. Learning Information Extraction
Rules for Semistructured and Free Text. Machine Learning. (to appear).

