EXTRACTIONBASED TEXT CATEGORIZATION: GENERATING
DOMAINSPECIFIC ROLE RELATIONSHIPS AUTOMATICALLY

ELLEN RILOFF AND JEFFREY LORENZEN
Department of Computer Science
University of Utah
Salt Lake City, UT 84112
friloff,jlorenzg@cs.utah.edu
In Natural Language Information Retrieval, forthcoming from Kluwer Academic Publishers.

Abstract. In previous work, we developed several algorithms that use information 
extraction techniques to achieve highprecision text categorization. 
The relevancy signatures algorithm classifies texts using extraction
patterns, and the augmented relevancy signatures algorithm classifies texts
using extraction patterns and semantic features associated with role fillers
(Riloff and Lehnert, 1994). These algorithms relied on handcoded training
data, including annotated texts and a semantic dictionary. In this chapter,
we describe two advances that significantly improve the practicality of our
approach. First, we explain how the extraction patterns can be generated
automatically using only preclassified texts as input. Second, we present
the wordaugmented relevancy signatures algorithm that uses lexical items
to represent domainspecific role relationships instead of semantic features.
Using these techniques, we can automatically build text categorization systems 
that benefit from domainspecific natural language processing.
Natural language processing and information retrieval evolved as separate 
research areas. To a large extent, the difference between them revolves
around the amount of knowledge that is used. Natural language processing
(NLP) techniques rely on syntactic and semantic knowledge that is often
manually encoded for a particular topic. Information retrieval (IR) techniques 
typically use general methods for processing large volumes of text
based on statistical word models. These two paradigms reflect a tugofwar
between the benefits of domain knowledge and the costs and constraints
associated with it.
Our goal has been to bridge this gap by developing methods to acquire
domain knowledge for NLP automatically. Our work focuses on a natural
language processing task called information extraction (IE), which aims
to extract domainspecific information from text. Recent advances in this
area have made it possible to build IE systems with minimal knowledge
engineering. For the last few years, we have been using information extraction 
techniques to build text categorization systems that benefit from
natural language processing while requiring little or no manual knowledge
engineering.
In previous work, we demonstrated that information extraction techniques 
could be used to achieve highprecision text categorization in some
domains. We used relevancy signatures to classify texts on the basis of
linguisticallymotivated extraction patterns, and augmented relevancy signatures 
to classify texts using extraction patterns and semantic features
associated with role fillers (Riloff and Lehnert, 1994). Both of these algorithms 
performed better than a comparable wordbased algorithm in
terrorism and joint venture domains (Riloff, 1994). Much of the domain
knowledge needed for these algorithms could be acquired automatically,
but there were still significant knowledgeengineering bottlenecks lurking
underneath. A manually annotated training corpus was required to generate 
the extraction patterns, and a dictionary of words tagged with semantic
features was needed for the augmented relevancy signatures.
In this article, we describe two advances that significantly improve the
practicality of our approach. First, we explain how the extraction patterns
can be generated automatically using a system called AutoSlogTS, without
the need for an annotated training corpus. Second, we present a variation of
the augmented relevancy signatures algorithm that does not rely on semantic 
features. The new version uses only lexical items to represent role fillers,
which are extracted automatically from the training corpus. Together with
AutoSlogTS, the wordaugmented relevancy signatures represent domain
specific role relationships that are generated automatically without any
manual knowledge engineering.
In the first section, we describe the extraction patterns used by our
natural language processing system and describe two previous text representations 
based on these extraction patterns, relevancy signatures and
augmented relevancy signatures. In the second section, we discuss our dictionary 
construction system, AutoSlogTS, that can generate extraction
patterns automatically. The only input to the system is a set of ``preclassified'' 
texts associated with the domain. In the third section, we present the
new wordaugmented relevancy signatures representation and algorithm.
Finally, we present experiments comparing the different text representations 
on four categorization tasks related to terrorism. In all cases, the
signature representations performed better than individual words and the
wordaugmented signatures performed comparably to the augmented relevancy 
signatures with semantic features.


References
Borko, H. and Bernick, M. 1963. Automatic Document Classification. J. ACM 10(2):151--
162.
Fuhr, N.; Hartmann, S.; Lustig, G.; Schwantner, M.; and Tzeras, Konstadinos 1991.
AIR/X  A RuleBased Multistage Indexing System for Large Subject Fields. In
Proceedings of RIAO 91. 606--623.
Goodman, M. 1991. Prism: A CaseBased Telex Classifier. In Proceedings of the Second
Annual Conference on Innovative Applications of Artificial Intelligence. AAAI Press.
25--37.
Hayes, Philip J. and Weinstein, Steven P. 1991. ConstrueTIS: A System for Content
Based Indexing of a Database of News Stories. In Proceedings of the Second Annual
Conference on Innovative Applications of Artificial Intelligence. AAAI Press. 49--64.
Hoyle, W. 1973. Automatic Indexing and Generation of Classification Systems by Algorithm. 
Information Storage and Retrieval 9(4):233--242.
Huffman, S. 1996. Learning information extraction patterns from examples. In Wermter,
Stefan; Riloff, Ellen; and Scheler, Gabriele, editors 1996, Connectionist, Statistical,
and Symbolic Approaches to Learning for Natural Language Processing. Springer
Verlag, Berlin. 246--260.
Kim, J. and Moldovan, D. 1993. Acquisition of Semantic Patterns for Information Extraction 
from Corpora. In Proceedings of the Ninth IEEE Conference on Artificial
Intelligence for Applications, Los Alamitos, CA. IEEE Computer Society Press. 171--
176.
Lehnert, W. 1991. Symbolic/Subsymbolic Sentence Analysis: Exploiting the Best of Two
Worlds. In Barnden, J. and Pollack, J., editors 1991, Advances in Connectionist and
Neural Computation Theory, Vol. 1. Ablex Publishers, Norwood, NJ. 135--164.
Maron, M. 1961. Automatic Indexing: An Experimental Inquiry. J. ACM 8:404--417.
MUC3 Proceedings, 1991. Proceedings of the Third Message Understanding Conference
(MUC3). Morgan Kaufmann, San Mateo, CA.
MUC4 Proceedings, 1992. Proceedings of the Fourth Message Understanding Conference
(MUC4). Morgan Kaufmann, San Mateo, CA.
MUC5 Proceedings, 1993. Proceedings of the Fifth Message Understanding Conference
(MUC5). Morgan Kaufmann, San Francisco, CA.
Riloff, E. and Lehnert, W. 1994. Information Extraction as a Basis for HighPrecision
Text Classification. ACM Transactions on Information Systems 12(3):296--333.
Riloff, E. 1994. Information Extraction as a Basis for Portable Text Classification Systems. 
Ph.D. Dissertation, CMPSCI Technical Report 9504, Department of Computer
Science, University of Massachusetts, Amherst, MA.
Riloff, E. 1995. Little Words Can Make a Big Difference for Text Classification. In
Proceedings of the 18th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval. 130--136.
Riloff, E. 1996a. An Empirical Study of Automated Dictionary Construction for Information 
Extraction in Three Domains. Artificial Intelligence 85:101--134.
Riloff, E. 1996b. Automatically Generating Extraction Patterns from Untagged Text.
In Proceedings of the Thirteenth National Conference on Artificial Intelligence. The
AAAI Press/MIT Press. 1044--1049.
Salton, G., editor 1971. The SMART Retrieval System: Experiments in Automatic Document 
Processing. Prentice Hall, Englewood Cliffs, NJ.
Soderland, S.; Fisher, D.; Aseltine, J.; and Lehnert, W. 1995. CRYSTAL: Inducing a conceptual 
dictionary. In Proceedings of the Fourteenth International Joint Conference
on Artificial Intelligence. 1314--1319.
Stanfill, C. and Waltz, D. 1986. Toward MemoryBased Reasoning. Communications of
the ACM 29(12):1213--1228.
Turtle, Howard and Croft, W. Bruce 1991. Efficient Probabilistic Inference for Text
Retrieval. In Proceedings of RIAO 91. 644--661.