The revolutionary neural network design that changed AI forever
In 2017, a team of Google researchers published a paper with the deceptively simple title "Attention Is All You Need." The paper introduced the Transformer architecture, which would go on to become the foundation of modern AI. Within five years, Transformer-based models powered search engines, translation systems, code autocompletion tools, and conversational AI assistants used by billions of people.
Understanding the Transformer architecture is essential for anyone working with modern AI systems. This article breaks down every component of the architecture, from the fundamental attention mechanism to the engineering decisions that make Transformers practical at scale.
The original "Attention Is All You Need" paper (Vaswani et al., 2017) was published at the NeurIPS conference. The authors—Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin—proposed replacing recurrent neural networks (RNNs) entirely with an architecture based entirely on attention mechanisms.
The motivation was clear: RNNs processed sequences token by token, making parallelization difficult and struggling with long-range dependencies. If you wanted to understand the relationship between the first and last words of a long sentence, an RNN had to maintain that information across dozens of processing steps, often suffering from vanishing or exploding gradients.
Transformers solved both problems. By processing entire sequences in parallel and allowing direct connections between any two positions, Transformers could both parallelize training efficiently and capture long-range dependencies more reliably.
A Transformer consists of an encoder and a decoder, though many modern LLMs use only the decoder portion (these are called "decoder-only" or "autoregressive" Transformers). The encoder processes the input sequence and creates representations; the decoder generates output one token at a time using those representations.
Input Sequence → [Encoder Layers] → Context Representations → [Decoder Layers] → Output Sequence
"The cat" → [6-12 layers] → [vectors for each token] → [6-12 layers] → "sat on mat"
Each encoder layer contains two main subcomponents: Multi-Head Self-Attention and a Feed-Forward Network. Each decoder layer adds a third component: Cross-Attention that connects to the encoder's output. Between these components are residual connections and layer normalization.
Self-attention is the mechanism that allows Transformers to relate positions in a sequence to each other. For each position, the model computes a weighted sum of values from all positions, where weights are determined by the similarity between queries and keys.
For each input token, the model creates three vectors: Query (Q), Key (K), and Value (V). These are created by multiplying the input embedding by three learned weight matrices: W_Q, W_K, and W_V.
Q = X · W_Q (what am I looking for?)
K = X · W_K (what do I contain?)
V = X · W_V (what information should I pass?)
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The scaling factor √d_k (where d_k is the key dimension, typically 64) prevents the dot products from growing too large. Without this scaling, the softmax would saturate in regions of large dot products, producing extremely peaked distributions and poor gradients.
The query-key-value separation might seem unnecessarily complex—why not just use the original embeddings directly? The insight is that learned linear projections allow the model to attend to different aspects of the input for different purposes.
In a translation task, for instance, one attention head might learn to align English words with their French equivalents, while another might attend to syntactic structure. By projecting into different Q, K, V spaces, the model can simultaneously track multiple types of relationships.
Instead of performing a single attention operation, the Transformer runs several in parallel. Each "head" has its own Q, K, V projections, allowing different attention patterns. The outputs of all heads are concatenated and projected again.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)
GPT-3 uses 96 attention heads with dimensions of 96 each (for a total attention dimension of 9216). This means each layer has approximately 96 × 3 = 288 weight matrices for Q, K, V projections alone.
Attention is inherently permutation-invariant—it processes all positions simultaneously without regard to their order. To inject positional information, Transformers add positional encodings to the input embeddings.
The original paper used sinusoidal encodings based on sine and cosine functions of different frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
This choice has a useful property: the encoding for any position can be represented as a linear function of encodings for earlier positions, making it easy for the model to learn relationships between positions.
Modern models often use Rotary Position Embeddings (RoPE), introduced with the LLaMA model, which encodes position information directly into the Q and K vectors through rotation. RoPE has become the dominant approach because it handles longer contexts better than sinusoidal encoding.
Each Transformer layer contains a feed-forward network (FFN) applied to each position separately. This is typically a two-layer MLP with a GELU activation:
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
In GPT models, this is often the larger portion of the computation. A 7B parameter model might have attention weights comprising about 1B parameters and FFN weights comprising 5.5B parameters—more than 75% of the model's capacity resides in these feed-forward layers.
Around each subcomponent, Transformers employ residual connections (skip connections) and layer normalization. These stabilize training by ensuring that the gradient flow remains strong even through deep networks.
x = LayerNorm(x + Sublayer(x))
The residual connection allows gradients to flow directly through the network without passing through the sublayer, enabling training of much deeper models. Modern Transformer variants use pre-norm (normalizing before the sublayer rather than after), which improves training stability for very deep models.
The original Transformer had separate encoder and decoder stacks. Understanding their differences illuminates modern model architectures.
| Model | Architecture | Layers | Parameters | Context Window |
|---|---|---|---|---|
| BERT-Base | Encoder-only | 12 | 110M | 512 |
| GPT-2 | Decoder-only | 48 | 1.5B | 2048 |
| GPT-3 | Decoder-only | 96 | 175B | 2048 |
| Llama 3 70B | Decoder-only + RoPE | 80 | 70B | 8192 |
Standard attention has O(n²) memory and compute complexity with sequence length n—doubling the context length quadruples the attention computation. This makes very long contexts expensive. Flash Attention (Dao et al., 2022) dramatically reduces memory usage by computing attention in tiles that fit in GPU SRAM, achieving 2-4x speedup and enabling longer context training.
During autoregressive generation, the key and value vectors for all previous tokens must be recomputed for each new token. Modern inference systems cache these K/V vectors to avoid recomputation, significantly speeding up generation. This "KV cache" grows with conversation length and is a major factor in memory requirements for serving.
The original Transformer has spawned numerous variants addressing various limitations:
The Transformer architecture's success stems from its elegant combination of mechanisms: attention for relating positions, multi-head attention for capturing diverse relationships, positional encoding for injecting order, and feed-forward networks for computation. These components work together to create models that can process sequential data with unprecedented parallelism and scale.
Understanding these fundamentals helps practitioners make informed decisions about model selection, optimization, and debugging. As the architecture continues to evolve—with new attention variants, mixture-of-experts approaches, and improved positional encodings—the core principles remain essential knowledge for anyone working in modern AI.