Indexing by Latent Semantic Analysis

Scott Deerwester
Graduate Library School
University of Chicago
Chicago, IL 60637
Susan T. Dumais
George W. Furnas
Thomas K. Landauer
Bell Communications Research
435 South St.
Morristown, NJ 07960
Richard Harshman
University of Western Ontario
London, Ontario Canada

ABSTRACT
A new method for automatic indexing and retrieval is described. The approach is to take
advantage of implicit higherorder structure in the association of terms with documents ("semantic
structure") in order to improve the detection of relevant documents on the basis of terms found in
queries. The particular technique used is singularvalue decomposition, in which a large term by
document matrix is decomposed into a set of ca 100 orthogonal factors from which the original
matrix can be approximated by linear combination. Documents are represented by ca 100 item
vectors of factor weights. Queries are represented as pseudodocument vectors formed from
weighted combinations of terms, and documents with suprathreshold cosine values are returned.
Initial tests find this completely automatic method for retrieval to be promising.

REFERENCES
1. Furnas, G.W., Landauer, T.K., Gomez, L.M., and Dumais, S.T. Statistical semantics:
Analysis of the potential performance of keyword information systems. Bell System
Technical Journal, 1983, 62(6), 17531806.
2. Tarr, D. and Borko, H. Factors influencing interindexer consistency. In Proceedings of the
ASIS 37th Annual Meeting, Vol. 11, 1974, 5055.
3. Fidel, R. Individual variability in online searching behavior. In C.A. Parkhurst (Ed.).
ASIS'85: Proceedings of the ASIS 48th Annual Meeting, Vol. 22, October 2024, 1985, 69
72.
4. Liley, O. Evaluation of the subject catalog. American Documentation, 1954, 5(2), 4160.
5. Bates, M.J. Subject access in online catalogs: A design model. JASIS, 1986, 37 (6), 357376.
6. Sparck Jones, K. A statistical interpretation of term specificity and its applications in
retrieval. Journal of Documentation, 1972, 28(1), 1121.
7. Gomez, L. M. and Lochbaum, C. C. People can retrieve more objects with enriched key
word vocabularies. But is there a human performance cost? In Proceedings of Interact 84,
London, England, September 1984.
8. Furnas, G.W. Experience with an adaptive indexing scheme. In Human Factors in
Computer Systems, CHI'85 Proceedings. San Francisco, Ca., April 1518, 1985.
9. van Rijsbergen, C.J. A theoretical basis for the use of cooccurrence data in information
retrieval. Journal of Documentation, 1977, 33(2), 106119.
10. Carroll, J.D. and Arabie, P. Multidimensional scaling. In M.R. Rosenzweig and L.W. Porter
(Eds.). Annual Review of Psychology, 1980, 31, 607649.
11. Sparck Jones, K. Automatic Keyword Classification for Information Retrieval, Buttersworth,
London, 1971.
12. Salton, G. Automatic Information Organization and Retrieval. McGraw Hill, 1968.
13. Jardin, N. and van Rijsbergen, C.J. The use of hierarchic clustering in information retrieval.
Information Storage and Retrieval, 1971, 7, 217240.
14. Baker, F.B. Information retrieval based on latent class analysis. Journal of the ACM, 1962,
9, 512521.
15. Atherton, P. and Borko, H. A test of factoranalytically derived automated classification
methods. AIP rept AIPDRP 651, Jan. 1965.
16. Borko, H and Bernick, M.D. Automatic document classification. Journal of the ACM, April
1963, 10(3), 151162.
17. Ossorio, P.G. Classification space: A multivariate procedure for automatic document
indexing and retrieval. Multivariate Behavioral Research, October 1966, 479524.
18. Salton, G. and McGill, M.J. Introduction to Modern Information Retrieval. McGrawHill,
1983.
19. Voorhees, E. The cluster hypothesis revisited. SIGIR, 1985, 188196.
20. Koll, M. An approach to conceptbased information retrieval. ACM SIGIR Forum, XIII32
50, 1979.
21. Raghavan, V. and Wong, S. A critical analysis of vector space model for information
retrieval. JASIS, 1986, 37(5), 279288.
22. Coombs, C.H. A Theory of Data. New York: Wiley, 1964.
23. Heiser, W.J. Unfolding Analysis of Proximity Data. Leiden, The Netherlands: Reprodienst
Psychologie RUL, 1981.
24. Desarbo, W.S., and Carroll, J.D. Threeway metric unfolding via alternating weighted least
squares. Psychometrika, 1985, 50(3), 275300.
25. Harshman, R.A. Foundations of the PARAFAC procedure: Models and conditions for an
"explanatory" multimodal factor analysis. UCLA Work Papers Phonetics, 1970, 16, 86pp.
26. Harshman, R.A. and Lundy, M.E. The PARAFAC model for threeway factor analysis and
multidimensional scaling. In H.G. Law, C.W. Snyder, Jr., J.A. Hattie, and R.P. McDonald
(Eds.). Research Methods for Multimode Data Analysis, Praeger, 1984a.
27. Carroll, J.D. and Chang, J.J. Analysis of individual differences in multidimensional scaling
via an Nway generalization of "EckartYoung" decomposition. Psychometrika, 1970, 35,
283319.
28. Kruskal, J.B. Factor analysis and principal components: Bilinear methods. In H. Kruskal and
J.M. Tanur (Eds.). International Encyclopedia of Statistics, New York: Free Press, 1978.
29. Furnas, G.W. Objects and their features: The metric representation of twoclass data. Ph.D.
Dissertation. Stanford University, 1980.
30. Forsythe, G.E., Malcolm, M.A., and Moler, C.B. Computer Methods for Mathematical
Computations (Chapter 9: Least squares and the singular value decomposition). Englewood
Cliffs, NJ: Prentice Hall, 1977.
31. Harshman, R.A. and Lundy, M.E. Data preprocessing and the extended PARAFAC model.
In H.G. Law, C.W. Snyder, Jr., J.A. Hattie, and R.P. McDonald (Eds.). Research Methods for
Multimode Data Analysis, Praeger, 1984b.
32. Jones, W.P. and Furnas, G.W. Pictures of relevance. JASIS, 1987, 38(6), 420442.
33. Golub, G.H., Luk, F.T., and Overton, M.L. A block Lanczos method for computing the
singular values and corresponding singular vectors of a matrix. ACM Transactions on
Mathematical Software, 1981, 7(2), 149169.
34. Cullum, J., Willoughby, R.A., and Lake, M. A Lanczos algorithm for computing singular
values and vectors of large matrices. SIAM J. Sci. Stat. Comput., 1983, 4(2), 197215.
35. Lesk, M.E. and Salton, G. Relevance assessments and retrieval system evaluation.
Information Storage and Retrieval, 1969, 4(4), 343359.
36. Amsler, R. Machinereadable dictionaries. In Annual Review of Information Science and
Technology (ARIST), Vol. 19, 1984, 161209.
37. Choueka, Y. and Lusignan, S. Disambiguation by short contexts. Computers and the
Humanities, 1985, 19, 147157.
38. Lesk, M.E. How to tell a pine cone from an ice cream cone. In Proceedings of ACM
SIGDOC Conference, Toronto, Ont., June, 1986.
