Re-Pre-Training Language Models for Low-Resource Languages

Language models are initially pre-trained on a huge corpus of mostly-unfiltered text in the target languages, then they are made into ChatLLMs by fine-tuning on a prompt dataset. The pre-training is the most expensive part by far, and if existing LLMs can't do basic sentences in your language, then one needs to start from that point by finding/scraping/making a huge dataset. One can exhaustively go through every available LLM and check its language abilities before investing in re-pre-training. There are surprisingly many of them - here's some list

If one needs to re-pre-train LLMs, it's better to start progressively by keeping most of it frozen and starting with only training the parts that need to change the most. For instance, one can start by training the tokenizer embedding layer, followed by the first and last layers. Once it has adapted to the new data, one can progressively unfreeze or ramp up the learning rate for the rest of the model. Doing it progressively reduces the likelihood that it will catastrophically forget stuff from its initial training.

It is essential to make a language-specific tokenizer, especially if the language uses non-latin script, as LLMs performance suffers when they need to use multiple tokens to represent each word, and that will be the case if the tokenizer was made without considering the language. It is also essential to keep in mind that better baseline LLMs may be released during the course of the project. One should focus early efforts on parts that will be transferrable, such as the tokenizer and datasets, and not care too much about things that are model-specific, such as hyperparameters.

Regarding using Google Translate to interface between the language and the LLM, it might not be an ideal solution as the translation can suffer from unnatural grammar and sentence structure. Instead, one can consider fine-tuning a model to do translation between the two languages if there isn't already a good enough model to do this. For example, Bloom can be used as a base model, depending on the desired use-case, as its pre-training includes 46 languages, compared to LLaMA, which was pre-trained on around 20 languages.

Tags: LLMs, ChatLLMs, Google Translate, Bloom, LLaMA

Similar Posts

Building Language Models for Low-Resource Languages

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, the researchers introduce the Sabiá: Portuguese Large Language Models and demonstrate that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. Few-shot evaluations … click here to read

Local Language Models: A User Perspective

Many users are exploring Local Language Models (LLMs) not because they outperform ChatGPT/GPT4, but to learn about the technology, understand its workings, and personalize its capabilities and features. Users have been able to run several models, learn about tokenizers and embeddings , and experiment with vector databases . They value the freedom and control over the information they seek, without ideological or ethical restrictions imposed by Big Tech. … click here to read

Reimagining Language Models with Minimalist Approach

The recent surge in interest for smaller language models is a testament to the idea that size isn't everything when it comes to intelligence. Models today are often filled with a plethora of information, but what if we minimized this to create a model that only understands and writes in a single language, yet knows little about the world? This concept is the foundation of the new wave of "tiny" language models .

A novel … click here to read

Navigating Language Models: A Practical Overview of Recommendations and Community Insights

Language models play a pivotal role in various applications, and the recent advancements in models like Falcon-7B, Mistral-7B, and Zephyr-7B are transforming the landscape of natural language processing. In this guide, we'll delve into some noteworthy models and their applications.

Model Recommendations

When it comes to specific applications, the choice of a language model can make a significant difference. Here are … click here to read

Reflections on Pretraining and Fine-Tuning in Reinforcement Learning

The world of reinforcement learning (RL) is continuously advancing and a recent study, titled "Reflections on Pretraining and Fine-Tuning in Reinforcement Learning" ( source ), further emphasizes the significance of pretraining. The authors surprisingly don't discuss open sourcing the weights, raising questions about their stance on knowledge sharing.

The study suggests that constructing a high-quality dataset for instruction fine-tuning could outshine larger, but less balanced datasets. This process could be optimized through a crowdsourcing approach … click here to read

Bringing Accelerated LLM to Consumer Hardware

MLC AI, a startup that specializes in creating advanced language models, has announced its latest breakthrough: a way to bring accelerated Language Model (LLM) training to consumer hardware. This development will enable more accessible and affordable training of advanced LLMs for companies and organizations, paving the way for faster and more efficient natural language processing.

The MLC team has achieved this by optimizing its training process for consumer-grade hardware, which typically lacks the computational power of high-end data center infrastructure. This optimization … click here to read

Exploration of Language Learning Models (LLMs)

For advanced Language Learning Models, consider Flan-UL2 . This model requires significant VRAM but provides excellent results with <2s inference speed. It's great for zero-shot tasks and prevents hallucinations.

Proper formatting and instruction tuning are key to maximizing your model's performance. You may find useful information on system, user, and special character formatting for messages on . Tools like Langchain or Transformer Agents can help abstract this process.

Be … click here to read

Max Context and Memory Constraints in Bigger Models

One common question that arises when discussing bigger language models is whether there is a drop-off in maximum context due to memory constraints. In this blog post, we'll explore this topic and shed some light on it.

Bigger models, such as GPT-3.5, have been developed to handle a vast amount of information and generate coherent and contextually relevant responses. However, the size of these models does not necessarily dictate the maximum context they can handle.

The memory constraints … click here to read

© 2023 All rights reserved.