UltraLM-13B on the Leaderboard
UltraLM-13B has now been tested on this open leaderboard. Click here to view the leaderboard. It's the 25th best 13B model on the leaderboard. If this is an accurate assessment, could its high AlpacaEval performance be a problem with UltraLM's dataset or an example of how bad AlpacaEval is and the concept of using LLMs to judge other LLMs? Edit: Quite bad on this leaderboard too. Here is the leaderboard.
Just have a look at the training dataset. If all of that was used during training it could be believed. That's 8 Gb of data! Here is the dataset.
UltraChat contains 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions.
This paper believes that the most straightforward way, that is, the quality and diversity of data employed in the training process, play a vital role in further improving the performance of chat language models. The paper: Click here to read the paper.
It's worth noting that some baselines seem underwhelming. Benchmarks don't tell the whole story, and the data generation strategy could be flawed. However, considering the scale of ChatGPT data, it's challenging not to gain an edge over regular fine-tunes, even well-engineered ones like Wizard. (BTW, people who are skeptical about it because it's Chinese should realize that all of Wizard authors are Chinese as well, only working in Hong Kong & Microsoft)
Always remember the thumb rule, if there's no change in architecture, training, or fine-tuning methodology, then it's most likely a gamification of the dataset to score better on the leaderboard. Here is additional information available on their GitHub. It appears that the model was developed by Qinghua University in China. Judgment should be withheld until transparently conducted and reproducibly designed benchmarks from other reputable sources become available.
It's remarkable that a 13B model can be so performant. Perhaps it's fine-tuned specifically to pass those tests. Some models may prioritize "passing tests" rather than helping humans become better. WizardLM is preferred as it feels the smartest among all tested models, so there's some sense to that leaderboard.