CLIR using a Probabilistic Translation Model based on Web
Documents

JianYun Nie
Laboratoire RALI,
Dpartement d'Informatique et Recherche oprationnelle,
Universit de Montral
C.P. 6128, succursale Centreville
Montral, Qubec, H3C 3J7 Canada
nie@iro.umontreal.ca

In this report, we describe the approach we used in TREC8 CrossLanguage IR (CLIR)
track. The approach is based on probabilistic translation models estimated from two
parallel training corpora: one established manually, and the other built automatically
with the documents mined from the Web. We describe the principle of model building,
the mining of parallel texts, as well as some preliminary evaluations.

References
[Brown93] P. F. Brown, S. A. D. Pietra, V. D. J. Pietra, and R. L. Mercer, The mathematics of
machine translation: Parameter estimation. Computational Linguistics, vol. 19, pp. 263312
(1993).
[Gale93] W. A. Gale, K.W. Church, A program for aligning sentences in bilingual corpora,
Computational Linguistics, 19 :1, 75102 (1993).
[Franz98] M. Franz, J.S. McCarley, S. Roukos, Ad hoc and multilingual information retrieval at
IBM, The Seventh Text Retrieval Conference (TREC7), NIST SP 500242, pp. 157168 (1998)
[Nie98] J.Y. Nie, TREC7 CLIR using a probabilistic translation model, The Seventh Text Retrieval
Conference (TREC7), NIST SP 500242, pp. 547553 (1998).
[Nie99] J.Y. Nie, P. Isabelle, M. Simard, R. Durand, Crosslanguage information retrieval based on
parallel texts and automatic mining of parallel texts from the Web, ACMSIGIR conference,
Berkeley, CA, pp. 7481(1999).
[Simard92] M. Simard, G. Foster, P. Isabelle, Using Cognates to Align Sentences in Parallel
Corpora, Proceedings of the 4th International Conference on Theoretical and Methodological
Issues in Machine Translation, Montreal (1992).

