Improving FullText Precision on Short Queries
using Simple Constraints

Marti A. Hearst
Xerox PARC
3333 Coyote Hill Rd
Palo Alto, CA 94304
(415) 8124742
hearst@parc.xerox.com

Abstract
We show that two simple constraints, when
applied to short user queries (on the order of
5--10 words) can yield precision scores comparable 
to or better than those achieved using
long queries (50--85 words) at low document
cutoff levels. These constraints are meant to
detect documents that have subtopic passages
that includes the most important components
of the query. The constraints are: (i) a simple 
Boolean constraint which requires the user
to specify the query as a list of topics; this
list is converted into a conjunct of disjuncts
by the system, and (ii) a subtopicsized proximity 
constraint imposed over the Boolean constraint. 
The vector space model is used to rank
the documents that satisfy both constraints. Experiments 
run over 45 TREC queries show significant, 
almost consistent improvements over
rankings that use no constraints. These results
have important ramifications for interactive systems 
intended for casual users, such as those
searching on the World Wide Web.

References
[1] P. Anick, J. Brennan, R. Flynn, D. Hanssen,
B. Alvey, and J. Robbins. A direct manipulation 
interface for boolean information retrieval
via natural language query. In Proceedings
of the 13th Annual International ACM/SIGIR
Conference, pages 135--150, Brussels, Belgium,
1990.
[2] N. J. Belkin and W. B. Croft. Retrieval techniques. 
In Martha E. Williams, editor, Annual 
Review of Information Science and Technolgy, 
pages 109--145. Elsevier Science Publishers, 1987.
[3] Chris Buckley, James Allan, and Gerard
Salton. Automatic routing and adhoc retrieval 
using SMART: TREC 2. In Donna Harman, editor, 
Proceedings of the Second Text
Retrieval Conference TREC2. National Institute 
of Standards and Technology Special Publication 500215, 1994.
[4] James P. Callan. Passagelevel evidence in document 
retrieval. In Proceedings of the 17th Annual 
International ACM/SIGIR Conference,
pages 302--310, Dublin, Ireland, 1994.
[5] Charles L. A. Clarke, Grodon V. Cormack,
and Forbes J. Burkowski. Shortest substring
ranking (multitext experiments for TREC4).
In Donna Harman, editor, Proceedings of the
Fourth Text Retrieval Conference TREC4.
National Institute of Standards and Technology 
Special Publication, 1996. (to appear).
[6] W. Bruce Croft, Robert Cook, and Dean
Wilder. Providing government information on
the internet: Experiences with THOMAS. In
Proceedings of Digital Libraries '95, pages 19--
24, Austin, TX, June 1995.
[7] Douglass R. Cutting, Jan O. Pedersen, and
PerKristian Halvorsen. An objectoriented architecture 
for text retrieval. In Conference
Proceedings of RIAO'91, Intelligent Text and
Image Handling, Barcelona, Spain, pages 285--
298, April 1991. Also available as Xerox PARC
technical report SSL9083.
[8] Stephanie W. Haas and Robert M. Losee Jr.
Looking in text windows: Their size and composition. 
Information Processing and Management, 30(5):619--629, 1994.
[9] Donna Harman, editor. Proceedings of the
Third Text Retrieval Conference TREC3. National 
Institute of Standards and Technology
Special Publication 500225, 1995.
[10] Donna Harman, editor. Proceedings of the
Fourth Text Retrieval Conference TREC4.
National Institute of Standards and Technology 
Special Publication, 1996. (to appear).
[11] Marti Hearst, Jan Pedersen, Peter Pirolli,
Hinrich Schuetze, Gregory Grefenstette, and
David Hull. Four TREC4 Tracks: the Xerox
site report. In Donna Harman, editor, Proceedings 
of the Fourth Text Retrieval Conference 
TREC4. National Institute of Standards
and Technology Special Publication, 1996. (to
appear).
[12] Marti A. Hearst. Multiparagraph segmentation 
of expository text. In Proceedings of the
32nd Meeting of the Association for Computational Linguistics, June 1994.
[13] Marti A. Hearst. Tilebars: Visualization of
term distribution information in full text in
formation access. In Proceedings of the ACM
SIGCHI Conference on Human Factors in
Computing Systems, Denver, CO, May 1995.
ACM.
[14] Marti A. Hearst and Christian Plaunt.
Subtopic structuring for fulllength document
access. In Proceedings of the 16th Annual
International ACM/SIGIR Conference, pages
59--68, Pittsburgh, PA, 1993.
[15] Rolf G. Henzler. Free or controlled vocabular
ies: Some statistical useroriented evaluations
of biomedical information systems. International 
Classification, 5(1):21--26, 1978.
[16] William R. Hersh, Diane L. Elliot, David H.
Hickam, Stephanie L. Wolf, and Anna Molnar.
Towards new measures of information retrieval
evaluation. In Proceedings of the 18th Annual
International ACM/SIGIR Conference, pages
164--170, Seattle, WA, 1995.
[17] Yufeng Jing and W. Bruce Croft. An association 
thesaurus for information retrieval. In
Proceedings of RIAO, pages 146--160, Rocke
feller University, New York, 1994.
[18] E. Michael Keen. Term position ranking: some
new test results. In Proceedings of the 15th Annual 
International ACM/SIGIR Conference,
pages 66--76, Copenhagen, Denmark, 1992.
[19] F. Lancaster. Vocabulary Control for Information 
Retrieval, Second Edition. Information
Resources, Arlington, VA, 1986.
[20] Ray R. Larson. Evaluation of advanced retrieval 
techniques in an experimental online
catalog. Journal of the American Society for
Information Science, 43(1):34--53, 1992.
[21] X. Allan Lu and Robert B. Keefer. Query expansion/reduction 
and its impact on retrieval
effectiveness. In Donna Harman, editor, Proceedings 
of the Third Text Retrieval Conference
TREC3, pages 231--239. National Institute of
Standards and Technology Special Publication
500225, 1995.
[22] Karen Markey, Pauline Atherton, and Claudia 
Newton. An analysis of controlled vocabulary 
and free text search statements in online
searches. Online Review, 4:225--236, 1982.
[23] G. McAlpine and P. Ingwersen. Integrated in
formation retrieval in a knowledge worker support 
system. In Proceedings of the 12th Annual
International ACM/SIGIR Conference, pages
48--57, Cambridge, MA, 1989.
[24] Elke Mittendorf and Peter Schauble. Document 
and passage retrieval based on hidden
markov models. In Proceedings of the 17th Annual 
International ACM/SIGIR Conference,
pages 318--327, Dublin, Ireland, 1994.
[25] Alistair Moffat, Ron SacksDavis, Ross Wilkin
son, and Justin Zobel. Retrieval of partial documents. 
In Donna Harman, editor, Proceedings of the Second 
Text Retrieval Conference
TREC2, pages 181--190. National Institute of
Standards and Technology Special Publication
500215, 1994.
[26] Gerard Salton. Automatic text processing:
the transformation, analysis, and retrieval of
information by computer. AddisonWesley,
Reading, MA, 1989.
[27] Gerard Salton and Chris Buckley. Improving
retrieval performance by relevance feedback.
JASIS, 41(4):288--297, 1990.
[28] Gerard Salton, Edward A. Fox, and Harry Wu.
Extended boolean information retrieval. Communications 
of the ACM, 26(11):1022--1036,
November 1983.
[29] Howard Turtle and W. Bruce Croft. Evaluation 
of an inference networkbased retrieval
model. ACM Transactions on Information
Systems, 9(3):187--222, 1991.
[30] Ellen M. Voorhees. Query expansion using
lexicalsemantic relations. In Proceedings of the
17th Annual International ACM/SIGIR Conference, 
pages 61--69, Dublin, Ireland, 1994.
[31] Ross Wilkinson. Effective retrieval of structured 
documents. In Proceedings of the
17th Annual International ACM/SIGIR Conference, 
pages 311--317, Dublin, Ireland, 1994.
