Re-Pre-Training Language Models for Low-Resource Languages
Language models are initially pre-trained on a huge corpus of mostly-unfiltered text in the target languages, then they are made into ChatLLMs by fine-tuning on a prompt dataset. The pre-training is the most expensive part by far, and if existing LLMs can't do basic sentences in your language, then one needs to start from that point by finding/scraping/making a huge dataset. One can exhaustively go through every available LLM and check its language abilities before investing in re-pre-training. There are surprisingly many of them - here's some list
If one needs to re-pre-train LLMs, it's better to start progressively by keeping most of it frozen and starting with only training the parts that need to change the most. For instance, one can start by training the tokenizer embedding layer, followed by the first and last layers. Once it has adapted to the new data, one can progressively unfreeze or ramp up the learning rate for the rest of the model. Doing it progressively reduces the likelihood that it will catastrophically forget stuff from its initial training.
It is essential to make a language-specific tokenizer, especially if the language uses non-latin script, as LLMs performance suffers when they need to use multiple tokens to represent each word, and that will be the case if the tokenizer was made without considering the language. It is also essential to keep in mind that better baseline LLMs may be released during the course of the project. One should focus early efforts on parts that will be transferrable, such as the tokenizer and datasets, and not care too much about things that are model-specific, such as hyperparameters.
Regarding using Google Translate to interface between the language and the LLM, it might not be an ideal solution as the translation can suffer from unnatural grammar and sentence structure. Instead, one can consider fine-tuning a model to do translation between the two languages if there isn't already a good enough model to do this. For example, Bloom can be used as a base model, depending on the desired use-case, as its pre-training includes 46 languages, compared to LLaMA, which was pre-trained on around 20 languages.