Yup'ik Eskimo and Machine Translation for Low-Resource Polysynthetic Languages

Jan. 2018 - Jan. 2019, with Christopher W. Liu

Won 1st prize (based on the reports) for the course project of CS224n - Natural Language Processing with Deep Learning.

Machine translation tools do not yet exist for the Yup’ik Eskimo language. It is an endangered language spoken by around 8,000 people who primarily live in Southwest Alaska. We created a dataset of Yup’ik Eskimo / English parallel text (~100k sentences) and developed a pipeline for reliable translation of this language pair.

We wrote a morphological rule-based parser for the Yup’ik Eskimo language and compared it with other unsupervised tokenization methods. We trained a bidirectional LSTM model with attention and reached a BLEU score of 13 using Byte-Pair Encoding, an unsupervised tokenization method.

Yup’ik names of children (Marc Lester / ADN)

We developed and launched in October 2018 Yugtun, a language and dictionary tool for Yup’ik Eskimo which is available online to help revitalize the language.

arXiv / Final Report / Machine Translation Code / Yuarcuun API Code / Yuarcuun Web interface

Byte pair encodings as good as morphological analysis based on Morfessor for low resource machine translation on Yup'ik Eskimo to English. pic.twitter.com/hdauwgEQUG
— Richard (@RichardSocher) 22 March 2018