Building Language Models for Low-Resource Languages

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, the researchers introduce the Sabiá: Portuguese Large Language Models and demonstrate that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that the models outperform English-centric and multilingual counterparts by a significant margin. The best model, Sabiá-65B, performs on par with GPT-3.5-turbo. The study shows the benefits of domain-specific knowledge acquired through monolingual pretraining.

If you are interested in building language models for low-resource languages, here are some steps you can follow:

  1. Check Arxiv for papers on foreign language datasets. There are often publications related to NLP tags that cover languages from Europe, Africa, and the Middle East.
  2. For "rare" languages, you can start with a multilingual language model (LLM) and fine-tune it. Facebook has released several 100+ language LLMs that can serve as a base point.
  3. Look for conferences and workshops on "low resource languages" if you are working with languages that have fewer than 1 million speakers globally. These events are common at large NLP conferences.

Aya is a multilingual LLM initiative by Cohere for AI, and they are currently looking for contributors from around the world. You can learn more and start contributing in your language here.

Getting more data is often the most straightforward solution for improving language models. Organizing, sourcing, and creating datasets have been instrumental in advancing AI research. Dataset building has had a significant impact, such as the creators of ImageNet, who contributed more to computer vision advancement than individual architectures or methodologies.

Open-source multilingual models like this one can be explored, although their compatibility with specific languages needs to be verified.

It's encouraging to see the demand for language models in low-resource languages. Researchers are actively working on finding solutions and improving LLMs for languages with fewer resources. Collaboration and efforts in this area will continue to be relevant and impactful.


Similar Posts


Reimagining Language Models with Minimalist Approach

The recent surge in interest for smaller language models is a testament to the idea that size isn't everything when it comes to intelligence. Models today are often filled with a plethora of information, but what if we minimized this to create a model that only understands and writes in a single language, yet knows little about the world? This concept is the foundation of the new wave of "tiny" language models .

A novel … click here to read


Navigating Language Models: A Practical Overview of Recommendations and Community Insights

Language models play a pivotal role in various applications, and the recent advancements in models like Falcon-7B, Mistral-7B, and Zephyr-7B are transforming the landscape of natural language processing. In this guide, we'll delve into some noteworthy models and their applications.

Model Recommendations

When it comes to specific applications, the choice of a language model can make a significant difference. Here are … click here to read


Re-Pre-Training Language Models for Low-Resource Languages

Language models are initially pre-trained on a huge corpus of mostly-unfiltered text in the target languages, then they are made into ChatLLMs by fine-tuning on a prompt dataset. The pre-training is the most expensive part by far, and if existing LLMs can't do basic sentences in your language, then one needs to start from that point by finding/scraping/making a huge dataset. One can exhaustively go through every available LLM and check its language abilities before investing in re-pre-training. There are surprisingly many of them … click here to read


Local Language Models: A User Perspective

Many users are exploring Local Language Models (LLMs) not because they outperform ChatGPT/GPT4, but to learn about the technology, understand its workings, and personalize its capabilities and features. Users have been able to run several models, learn about tokenizers and embeddings , and experiment with vector databases . They value the freedom and control over the information they seek, without ideological or ethical restrictions imposed by Big Tech. … click here to read


Max Context and Memory Constraints in Bigger Models

One common question that arises when discussing bigger language models is whether there is a drop-off in maximum context due to memory constraints. In this blog post, we'll explore this topic and shed some light on it.

Bigger models, such as GPT-3.5, have been developed to handle a vast amount of information and generate coherent and contextually relevant responses. However, the size of these models does not necessarily dictate the maximum context they can handle.

The memory constraints … click here to read


Transforming LLMs with Externalized World Knowledge

The concept of externalizing world knowledge to make language models more efficient has been gaining traction in the field of AI. Current LLMs are equipped with enormous amounts of data, but not all of it is useful or relevant. Therefore, it is important to offload the "facts" and allow LLMs to focus on language and reasoning skills. One potential solution is to use a vector database to store world knowledge.

However, some have questioned the feasibility of this approach, as it may … click here to read


Programming with Language Models

Programming with language models has become an increasingly popular approach for code generation and assistance. Whether you are a professional programmer or a coding enthusiast, leveraging language models can save you time and effort in various coding tasks.

When it comes to using language models for code generation, a direct prompting approach may not yield the best results. Instead, utilizing a code-writing agent can offer several advantages. These agents can handle complex coding tasks by splitting them into files and functions, generate code iteratively, … click here to read


Bringing Accelerated LLM to Consumer Hardware

MLC AI, a startup that specializes in creating advanced language models, has announced its latest breakthrough: a way to bring accelerated Language Model (LLM) training to consumer hardware. This development will enable more accessible and affordable training of advanced LLMs for companies and organizations, paving the way for faster and more efficient natural language processing.

The MLC team has achieved this by optimizing its training process for consumer-grade hardware, which typically lacks the computational power of high-end data center infrastructure. This optimization … click here to read



© 2023 ainews.nbshare.io. All rights reserved.