AI Transformers and the Self-Attention Mechanism


By Adrian Zidaritz

Original: 5/20/21
Revised: 11/26/22

   Transformers have been the most interesting new development in neural networks, especially in language applications, where in the beginning neural networks did not enjoy the same success as in computer vision applications. Moreover, large models built around language transformers have shown the ability to "understand" other types of data, not just text. A transformer has an ingenious neural network architecture based on the concept of attention. The architecture bridges nicely with that most prized computer science concept, parallel processing, allowing transformers to be trained in reasonable time on massive datasets. This short note will be a more detailed introduction to transformers, but until then this is a good introduction:



This note on transformers is taking much longer than I anticipated but there are many good presentations popping up all the time online. So until I get my act together, here is one from Google Brain Zurich; it shows how transformers subsume many of the other neural network architectures, like CNNs and LSTMs, and even parts of reinforcement learning.