An Informationtheoretic Measure for Document Similarity 

Javed A. Aslam
Department of Computer Science
Dartmouth College
jaa@cs.dartmouth.edu
Meredith Frost
Department of Computer Science
Dartmouth College
Meredith.Frost@dartmouth.edu

ABSTRACT
Recent work has demonstrated that the assessment of pair
wise object similarity can be approached in an axiomatic
manner using information theory. We extend this concept
specifically to document similarity and test the effective
ness of an informationtheoretic measure for pairwise docu
ment similarity. We adapt query retrieval to rate the quality
of document similarity measures and demonstrate that our
proposed informationtheoretic measure for document similarity 
yields statistically significant improvements over other
popular measures of similarity.

4. REFERENCES
[1] R. BaezaYates and B. RibeiroNeto. Modern
Information Retrieval. AddisonWesley, 1999.
[2] T. Cover and J. Thomas. Elements of Information
Theory. WileyInterscience, 1991.
[3] D. Lin. An informationtheoretic definition of
similarity. In Proc. 15th International Conf. on
Machine Learning, 1998.
[4] M. McGill, M. Koll, and T. Norreault. An evaluation of
factors a#ecting document ranking by information
retrieval systems. Technical report, Syracuse University
School of Information Studies, 1979.
[5] C. J. van Rijsbergen. Information Retrieval.
Butterworths, 1979.

