Learning Word Segmentation Rules for Tag
Prediction

Dimitar Kazakov1, Suresh Manandhar2, and Tomaz Erjavec3

University of York, Heslington, York YO1O 5DD, UK,
{kazakov, suresh}@cs.york.ac.uk,
WWW home page: 1 http://www.cs.york.ac.uk/~kazakov/ and
2 http://www.cs.york.ac.uk/~suresh/

3 Department for Intelligent Systems,
Jozef Stefan Institute, Ljubljana, Slovenia,
Tomaz.Erjavec@ijs.si,
WWW home page: http://nl.ijs.si/tomaz/



Abstract. In our previous work we introduced a hybrid, GA&ILP-based 
approach for learning of stem-suffix segmentation rules from an
unmarked list of words. Evaluation of the method was made difficult by
the lack of word corpora annotated with their morphological segmentation. 
Here the hybrid approach is evaluated indirectly, on the task of
tag prediction. A pair of stem-tag and suffix-tag lexicons is obtained by
the application of that approach to an annotated lexicon of word-tag
pairs. The two lexicons are then used to predict the tags of unseen words
in two ways, (1) by using only the stem and suffix generated by the
segmentation rules, and (2) for all matching combinations of stem and
suffix present in the lexicons. The results show high correlation between
the constituents generated by the segmentation rules, and the tags of
the words in which they appear, thereby demonstrating the linguistic
relevance of the segmentations produced by the hybrid approach.
References

1.	E. Brill. Some advances in transformation-based part of speech tagging. In Proceedings 
of AAAI-94: pages 748753. AAAI Press/MIT Press, 1994.
2.	Tomaz Erjavec. The MULTEXT-East Slovene Lexicon. In Proceedings of the 7th
Electrotechnical Conference ERK, Volume B, pages 189192, Portoroz, Slovenia,
1998.
3.	David E. Goldberg. Genetic Algorithms in Search: Optimization, and Machine
Learning. Addison-Wesley, 1989.
4.	Dimitar Kazakov. Unsupervised learning of naive morphology with genetic algorithms. 
In W. Daelemans, A. van den Bosch, and A. Weijters, editors, Workshop
Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language
Processing Tasks: pages 105112, Prague, April 1997.
5.	Dimitar Kazakov, and Suresh Manandhar. A Hybrid Approach to Word Segmentation. 
In D. Page, editor, Proc. of the 8th International Workshop on Inductive
Logic Programming (ILP-98), pages 125134. Berlin, 1998. Springer-Verlag.
6.	Suresh Manandhar, Saso Dzeroski, and Tomaz Erjavec. Learning Multilingual
Morphology with CLOG. In The Eighth International Conference on Inductive
Logic Programming (ILP98): Madison, Wisconsin, USA, 1998.
7.	Raymond J. Mooney and Mary Elaine Calif. Induction of firstorder decision
lists: Results on learning the past tense of English verbs. Journal of Artificial
Intelligence Research, June 1995.
