Investigating Techniques for Improving NMT Systems for Low Resource Languages

Neural Machine Translation (NMT) has become the standard for Machine Translation tasks, however, they encounter many technical challenges when training in low resource language pairs. In this paper, we investigate how different subword and word representations, as well as different data augmentation techniques can improve NMT performance on low resource languages. For our baseline, we train an encoder-decoder based seq2seq NMT model on a scarce Nepali-English dataset. Then, we compare different subword and word representations, such as Byte Pair Encoding (BPE) and a reduced vocab set. Finally, we augment our training data with backtranslation of monolingual data, transfer learning from Hindi, and noisy data. In addition, we propose a new variant of backtranslation for low-resource NMT that exceeds performance of traditional backtranslation methods. We find that BPE was the best performing subword representation. For data augmentation, we find that transfer learning and noisy data gives reliable improvements, yet back translation requires careful management of noise levels. By utilizing our novel variant of backtranslation alongside BPE and auxiliary data methods in combined models, we are able to increase in-domain performance by +4.55 BLEU and out-of-domain performance by +3.93 BLEU compared to the baseline.