Static and Dynamic Information Organization
with Star Clusters

Javed Aslam Katya Pelekhov Daniela Rus
Department of Computer Science
Dartmouth College
Hanover, NH 03755

Abstract
In this paper we present a system for static and dy
namic information organization and show our evaluations
of this system on TREC data. We introduce the offline
and online star clustering algorithms for information or
ganization. Our evaluation experiments show that the offline 
star algorithm outperforms the single link and average
link clustering algorithms. Since the star algorithm is also
highly e#cient and simple to implement, we advocate its
use for tasks that require clustering, such as information
organization, browsing, filtering, routing, topic tracking,
and new topic detection.


References
[AB84] M. Aldenderfer and R. Blashfield, Cluster Analysis, Sage, Beverly Hills, 1984.
[All95] J. Allan. Automatic hypertext construction. PhD
thesis. Department of Computer Science, Cornell
University, January 1995.
[All98] J. Allan, Personal communication, March 1998.
[APR98] J. Aslam, K. Pelekhov, and D. Rus, Generating, 
visualizing, and evaluating highaccuracy clusters for 
information organization, in Principles of
Digital Document Processing, eds. C. Nicholas, Lecture 
Notes in Computer Science, Sprinter Verlag
1998 (to appear). Also available as Technical Report
PCSTR97319, Department of Computer Science,
Dartmouth, 1997.
[APR97] J. Aslam, K. Pelekhov, and D. Rus, Computing
Dense Clusters Online for Information Organization, 
Technical Report PCSTR97324, Department
of Computer Science, Dartmouth, 1997.
[Bol95] B. Bollobas, Random Graphs, Academic Press,
London, 1995.
[Bur95] R. Burgin, The retrieval e#ectiveness of five clustering 
algorithms as a function of indexing exhaustively, 
Journal of American Society for Information Science 46(8:562572, 1995.
[Can93] F. Can, Incremental clustering for dynamic information 
processing, in ACM Transactions on Information 
Systems, no. 11, pp143164, 1993.
[CCFM97] M. Charikar, C. Chekuri, T. Feder, and R.
Motwani, Incremental clustering and dynamic in
formation retrieval, in Proceedings of the 29 th Symposium 
on Theory of Computing, 1997.
[Cro80] W. B. Croft. A model of cluster searching based
on classification. Information Systems, 5:189195,
1980.
[Cro77] W. B. Croft. Clustering large files of documents
using the singlelink method. Journal of the American 
Society for Information Science, pp189195,
November 1977.
[CKP93] D. Cutting, D. Karger, and J. Pedersen. Constant 
interactiontime Scatter/Gather browsing of
very large document collections. In Proceedings of
the 16 th SIGIR, 1993.
[FG88] T. Feder and D. Greene, Optimal algorithms for
approximate clustering, in Proceedings of the 20 th
Symposium on Theory of Computing, pp 434444,
1988.
[HP96] M. Hearst and J. Pedersen. Reexamining the cluster 
hypothesis: Scatter/Gather on Retrieval Results. 
In Proceedings of the 19 th SIGIR, 1996.
[HS86] D. Hochbaum and D. Shmoys, A unified approach
to approximation algorithms for bottleneck problems, 
Journal of the ACM, no. 33, pp533550, 1986.
[JD88] A. Jain and R. Dubes. Algorithms for Clustering
Data, Prentice Hall 1988.
[JR71] N. Jardine and C.J. van Rijsbergen. The use of
hierarchical clustering in information retrieval, In
formation Storage and Retrieval, 7:217240, 1971.
[KP93] G. Kortsarz and D. Peleg. On choosing a dense
subgraph. In Proceedings of the 34th Annual Symposium 
on Foundations of Computer Science (FOCS),
1993.
[LLR95] N. Linial, E. London, and Y. Rabinovich. The
geometry of graphs and some of its algorithmic applications. 
Combinatorica 15(2):215245, 1995.
[LY94] C. Lund and M. Yannakakis. On the hardness of
approximating minimization problems. Journal of
the ACM 41, 960--981, 1994.
[Rij79] C.J. van Rijsbergen. Information Retrieval. But
terworths, London, 1979.
[Sal89] G. Salton. Automatic Text Processing: the transformation, 
analysis, and retrieval of information by
computer, AddisonWesley, 1989.
[Sal91] G. Salton. The Smart document retrieval project.
In Proceedings of the Fourteenth Annual International 
ACM/SIGIR Conference on Research and
Development in Information Retrieval, pp 356358.
[Sha86] W. Shaw, On the foundation of evaluation, Journal 
of the American Society for Information Science, vol 37, pp 346348, 1986.
[Sha93] W. Shaw, Controlled and uncontrolled subject de
scriptions in the CF database: a comparison of
optimal clusterbased retrieval results, Information
Processing and Management, vol. 29, pp 751763,
1993.
[SJJ70] K. Spark Jones and D. Jackson. The use of
automatically obtained keyword classifications for
information retrieval. Information Storage and Retrieval, 5:174201, 1970.
[Tur90] H. Turtle. Inference networks for document retrieval. 
PhD thesis. University of Massachusetts,
Amherst, 1990.
[vRC75] C.J. van Rijsbergen and B. Croft, Document clustering: 
an evaluation of some experiments with the
Cranfield 1400 collection, Information Processing
and Management, 11, 171182.
[Voo85] E. Voorhees. The effectiveness and efficiency of agglomerative
hierarchical clustering in document retrieval, PhD Thesis, 
Department of Computer Science, 
Cornell University 1985, available as TR 85
705.
[Voo85] E. Voorhees. The cluster hypothesis revisited. In
Proceedings of the 8 th SIGIR, pp 95104, 1985.
[Wil88] P. Willett. Recent trends in hierarchical document
clustering: A critical review. Information Processing and Management, 24:(5):577597, 1988.
[Wor71] S. Worona. Query clustering in a large document
space. In Ed. G. Salton, The SMART Retrieval System, pp 298310. PrenticeHall, 1971.
[Zuc93] D. Zuckerman. NPcomplete problems have a version 
that's hard to approximate. In Proceedings
of the Eight Annual Structure in Complexity Theory 
Conference, IEEE Computer Society, 305--312,
1993.
