Discussion on Parallel Transformer Layers and Model Performance
The recent discussion raises important concerns about the lack of key paper citations, particularly regarding the parallel structure in Transformer layers. It's worth noting that this concept was first proposed in the paper "MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning" (see Formula 2). Further, the notion of merging linear layers of the MLP and self-attention to enhance time efficiency was discussed in Section 3.5.
One of the points in the discussion is the performance of models smaller than 1b. While there's an argument that models with less than 1b parameters might struggle to generate coherent output, it's not always the case. The 7b models seem to perform well, but the performance of 3b models is yet to be explored more thoroughly.
Another interesting question is the performance comparison between this model and MPT-7B. While MPT-7B seems to be the superior architecture at first glance, and considering it was pre-trained on 1 trillion tokens, an empirical comparison would be beneficial to make a more informed decision.
Comparing the performance with the original PaLM paper results is another crucial point. A special shoutout to the community for spending resources on this. An important query is whether there is a PR to Hugging Face for implementing this and the PALM RLHF with HF API. It would indeed be great to have such a feature.
Finally, a big thank you for appreciating the detailed post on the data, training, and final model weights. The community thrives on such active participation and insight sharing.