What has changed in Transformer architecture?
There have been close to no improvements on the original transformer architecture. Different architecture are better at different tasks, and the training objective can also vary. There's a major error in the paper "Attention is All You Need" where they accidentally put the layer norms after the layers not before them. Putting attention layers and MLPs in parallel makes the model run much faster, but doesn't really affect performance. The original positional embedding method is garbage, and rotary positional embeddings are currently considered the mainstream way to do it. Application paradigms and fine-tuning have changed, such as domain-specific fine-tuning, few-shot prompting, multitask fine-tuning, and reinforcement learning from human feedback. However, the original paper got most of it right, and the architecture has not changed much.
SE3-transformers capture lossless encodings of 3D structure while being invariant to rotations or subunit permutation/ordering. They are conceptually beautiful, and their nature is to capture three-dimensional data. A paper published by DeepMind is quite helpful to understand them. Additionally, you can check out the SE3-Transformer paper to get a better understanding of how they work.
Huggingface's NLP Course and Jalammar's blog are both great resources to learn about Transformers. Huggingface's course is concise but covers most of the popular things related to Transformers. Jalammar's blog, on the other hand, covers most popular things related to Transformers, and you can read his other blogs too.
The training data seems to have a much larger effect on final performance than minor tweaks in architecture. Llama, for example, has strong performance because it was pretrained on much more data than comparably sized models. Instruction tuning is definitely a big plus for models like FlanT5 and Alpaca.