Improving Low-Resource Languages in Pre-Trained Language Models

Viktor Hangya, Hossain Shaikh Saadi, Alexander Fraser

EMNLP 2022

Pre-trained multilingual language models are the foundation of many
NLP approaches, including cross-lingual transfer solutions. However,
languages with small available monolingual corpora are often not well
supported by these models leading to poor performance. We propose an
unsupervised approach to improve the cross-lingual representations of
low-resource languages by bootstrapping word translation pairs from
monolingual corpora and using them to improve language alignment in
pre-trained language models. We perform experiments on nine
languages, using contextual word retrieval and zero-shot named entity
recognition to measure both intrinsic cross-lingual word
representation quality and downstream task performance, showing
improved performance. Our results show that it is possible to improve
pre-trained multilingual language models by relying only on
non-parallel resources.