Building Language Models for Low-Resource Languages
As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, the researchers introduce the Sabiá: Portuguese Large Language Models and demonstrate that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that the models outperform English-centric and multilingual counterparts by a significant margin. The best model, Sabiá-65B, performs on par with GPT-3.5-turbo. The study shows the benefits of domain-specific knowledge acquired through monolingual pretraining.
If you are interested in building language models for low-resource languages, here are some steps you can follow:
- Check Arxiv for papers on foreign language datasets. There are often publications related to NLP tags that cover languages from Europe, Africa, and the Middle East.
- For "rare" languages, you can start with a multilingual language model (LLM) and fine-tune it. Facebook has released several 100+ language LLMs that can serve as a base point.
- Look for conferences and workshops on "low resource languages" if you are working with languages that have fewer than 1 million speakers globally. These events are common at large NLP conferences.
Aya is a multilingual LLM initiative by Cohere for AI, and they are currently looking for contributors from around the world. You can learn more and start contributing in your language here.
Getting more data is often the most straightforward solution for improving language models. Organizing, sourcing, and creating datasets have been instrumental in advancing AI research. Dataset building has had a significant impact, such as the creators of ImageNet, who contributed more to computer vision advancement than individual architectures or methodologies.
Open-source multilingual models like this one can be explored, although their compatibility with specific languages needs to be verified.
It's encouraging to see the demand for language models in low-resource languages. Researchers are actively working on finding solutions and improving LLMs for languages with fewer resources. Collaboration and efforts in this area will continue to be relevant and impactful.