Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings

Lisa Woller, Viktor Hangya Alexander Fraser

2021

Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 41–50, November 11, 2021. 

Cross-lingual word embeddings (CLWEs) have proven indispensable for
various natural language processing tasks, e.g., bilingual lexicon
induction (BLI). However, the lack of data often impairs the quality
of representations.  Various approaches requiring only weak
cross-lingual supervision were proposed, but current methods still
fail to learn good CLWEs for languages with only a small monolingual
corpus.  We therefore claim that it is necessary to explore further
datasets to improve CLWEs in low-resource setups. In this paper we
propose to incorporate data of related high-resource languages. In
contrast to previous approaches which leverage independently
pre-trained embeddings of languages, we (i) train CLWEs for the
low-resource and a related language jointly and (ii) map them to the
target language to build the final multilingual space. In our
experiments we focus on Occitan, a low-resource Romance language which
is often neglected due to lack of resources. We leverage data from
French, Spanish and Catalan for training and evaluate on the
Occitan-English BLI task. By incorporating supporting languages our
method outperforms previous approaches by a large margin. Furthermore,
our analysis shows that the degree of relatedness between an
incorporated language and the low-resource language is critically
important.