Comparing the Performance of Database Selection Algorithms

James C. French 1 , Allison L. Powell 1 , Jamie Callan 2 , Charles L. Viles 3
Travis Emmitt 1 , Kevin J. Prey 1 , Yun Mou 2
1 Dept. of Computer Science \Lambda
University of Virginia
Charlottesville, VA
2 Computer Science Dept. y
Univ. of Massachusetts
Amherst, MA
3 School of Info. and Library Science z
Univ. of North Carolina--Chapel Hill
Chapel Hill, NC

Abstract We compare the performance of two database selection 
algorithms reported in the literature. Their performance 
is compared using a common testbed designed specifically 
for database selection techniques. The testbed is a decomposition 
of the TREC/TIPSTER data into 236 subcollections. 
The databases from our testbed were ranked using
both the gGlOSS and CORI techniques and compared to a
baseline derived from TREC relevance judgements. We examined 
the degree to which CORI and gGlOSS approximate
this baseline. Our results confirm our earlier observation
that the gGlOSS Ideal(l) ranks do not estimate relevance
based ranks well. We also find that CORI is a uniformly better 
estimator of relevancebased ranks than gGlOSS for the
test environment used in this study. Part of the advantage
of the CORI algorithm can be explained by a strong correlation 
between gGlOSS and a sizebased baseline (SBR). We
also find that CORI produces consistently accurate rankings 
on testbeds ranging from 100--921 sites. However for a
given level of recall, search effort appears to scale linearly
with the number of databases.

References
[1] J. Allan, J. P. Callan, W. B. Croft, L. Ballesteros, D. Byrd,
R. Swan, and J. Xu. INQUERY does battle with TREC6.
In Proc. TREC6.
[2] N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining 
the Evidence of Multiple Query Representations for
Information Retrieval. Information Processing & Management, 31(4):431--448, 1995.
[3] C. Buckley. SMART version 11.0, 1992.
ftp://ftp.cs.cornell.edu/pub/smart.
[4] C. Buckley, J. Allan, and G. Salton. Automatic routing and
adhoc retrieval using SMART : TREC 2. In Proc. TREC2,
pages 45--56, 1994.
[5] C. Buckley, G. Salton, and J. Allan. Automatic retrieval
with locality information using SMART. In Proc. TREC1,
pages 59--72, 1993.
[6] J. P. Callan, W. B. Croft, and J. Broglio. TREC and TIP
STER experiments with INQUERY. Information Processing
& Management, 31(3):327--343, 1995.
[7] J. P. Callan, Z. Lu, and W. B. Croft. Searching Distributed
Collections with Inference Networks. In Proc. SIGIR'95,
pages 21--29, 1995.
[8] E. A. Fox, M. P. Koushik, J. Shaw, R. M. odlin, and
D. Rao. Combining Evidence from Multiple Searches. In
Proc. TREC1, pages 319--328, 1992.
[9] J. C. French, A. L. Powell, C. L. Viles, T. Emmitt, and K. J.
Prey. Evaluating Database Selection Techniques: A Testbed
and Experiment. In Proc. SIGIR'98, pages 121--129, 1998.
[10] J. C. French and C. L. Viles. Ensuring Retrieval Effectiveness
in Distributed Digital Libraries. Journal of Visual Communication 
and Image Representation, 7(1):61--73, 1996.
[11] N. Fuhr. A DecisionTheoretic Approach to Database Selection 
in Networked IR. ACM Trans. on Information Systems.
to appear.
[12] J. D. Gibbons. Nonparametric Methods for Quantative
Analysis. Holt, Rinehart and Winston, 1976.
[13] L. Gravano and H. GarciaMolina. Generalizing GlOSS to
VectorSpace Databases and Broker Hierarchies. In Proc.
VLDB'95, 1995.
[14] L. Gravano, H. GarciaMolina, and A. Tomasic. GlOSS:
Textsource discovery over the internet. ACM Trans. on
Database Systems, To appear.
[15] L. Gravano, H. GarciaMolina, and A. Tomasic. The Effectiveness 
of GlOSS for the Text Database Discovery Problem.
In Proc. SIGMOD'94, pages 126--137, May 1994.
[16] D. Harman. Overview of the Fourth Text Retrieval Conference 
(TREC4). In Proc. TREC4, 1996.
[17] Z. Lu, J. P. Callan, and W. B. Croft. Measures in collection
ranking evaluation. Technical Report TR9639, Computer
Science Department, University of Massachusetts, 1996.
[18] A. Moffat and J. Zobel. Information Retrieval Systems for
Large Document Collections. In Proc. TREC3, pages 85--94,
1995.
[19] A. Singhal, C. Buckley, and M. Mitra. Pivoted document
length normalization. In Proc. SIGIR'96, pages 21--29, 1996.
[20] C. L. Viles and J. C. French. TREC4 Experiments Using
Drift. In Proc. TREC4, 1996.
[21] C. L. Viles and J. C. French. Dissemination of Collection
Wide Information in a Distributed Inform ation Retrieval
System. In Proc. SIGIR'95, pages 12--20, July 1995.
[22] E. Voorhees, N. Gupta, and B. JohnsonLaird. Learning
Collection Fusion Strategies. In Proc. SIGIR'95, pages 172--
179, 1995.
[23] E. Voorhees, N. Gupta, and B. JohnsonLaird. The Collection 
Fusion Problem. In Proc. TREC3, pages 95--104, 1995.
[24] J. Xu and J. Callan. Effective Retrieval with Distributed
Collections. In Proc. SIGIR'98, pages 112--120, 1998.
[25] B. Yuwono and D. Lee. Server Ranking for Distributed Text
Retrieval Systems on Internet. In Proc. of the Int. Conf. on
Database Systems for Adv. Applications, pages 41--49, 1997.