Transformer Architecture Explained

The revolutionary neural network design that changed AI forever

Published: January 2026 | Reading Time: 14 minutes | Category: AI & Machine Learning

Neural network abstract visualization

In 2017, a team of Google researchers published a paper with the deceptively simple title "Attention Is All You Need." The paper introduced the Transformer architecture, which would go on to become the foundation of modern AI. Within five years, Transformer-based models powered search engines, translation systems, code autocompletion tools, and conversational AI assistants used by billions of people.

Understanding the Transformer architecture is essential for anyone working with modern AI systems. This article breaks down every component of the architecture, from the fundamental attention mechanism to the engineering decisions that make Transformers practical at scale.

The Paper That Changed Everything

The original "Attention Is All You Need" paper (Vaswani et al., 2017) was published at the NeurIPS conference. The authors—Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin—proposed replacing recurrent neural networks (RNNs) entirely with an architecture based entirely on attention mechanisms.

The motivation was clear: RNNs processed sequences token by token, making parallelization difficult and struggling with long-range dependencies. If you wanted to understand the relationship between the first and last words of a long sentence, an RNN had to maintain that information across dozens of processing steps, often suffering from vanishing or exploding gradients.

Transformers solved both problems. By processing entire sequences in parallel and allowing direct connections between any two positions, Transformers could both parallelize training efficiently and capture long-range dependencies more reliably.

The Architecture Overview

A Transformer consists of an encoder and a decoder, though many modern LLMs use only the decoder portion (these are called "decoder-only" or "autoregressive" Transformers). The encoder processes the input sequence and creates representations; the decoder generates output one token at a time using those representations.

Input Sequence → [Encoder Layers] → Context Representations → [Decoder Layers] → Output Sequence
   "The cat"    → [6-12 layers]    → [vectors for each token] → [6-12 layers]   → "sat on mat"
    

Each encoder layer contains two main subcomponents: Multi-Head Self-Attention and a Feed-Forward Network. Each decoder layer adds a third component: Cross-Attention that connects to the encoder's output. Between these components are residual connections and layer normalization.

Self-Attention: The Core Innovation

Self-attention is the mechanism that allows Transformers to relate positions in a sequence to each other. For each position, the model computes a weighted sum of values from all positions, where weights are determined by the similarity between queries and keys.

The Q, K, V Computation

For each input token, the model creates three vectors: Query (Q), Key (K), and Value (V). These are created by multiplying the input embedding by three learned weight matrices: W_Q, W_K, and W_V.

Q = X · W_Q   (what am I looking for?)
K = X · W_K   (what do I contain?)
V = X · W_V   (what information should I pass?)

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
    

The scaling factor √d_k (where d_k is the key dimension, typically 64) prevents the dot products from growing too large. Without this scaling, the softmax would saturate in regions of large dot products, producing extremely peaked distributions and poor gradients.

Why Queries, Keys, and Values?

The query-key-value separation might seem unnecessarily complex—why not just use the original embeddings directly? The insight is that learned linear projections allow the model to attend to different aspects of the input for different purposes.

In a translation task, for instance, one attention head might learn to align English words with their French equivalents, while another might attend to syntactic structure. By projecting into different Q, K, V spaces, the model can simultaneously track multiple types of relationships.

Multi-Head Attention

Instead of performing a single attention operation, the Transformer runs several in parallel. Each "head" has its own Q, K, V projections, allowing different attention patterns. The outputs of all heads are concatenated and projected again.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O

where head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)
    

GPT-3 uses 96 attention heads with dimensions of 96 each (for a total attention dimension of 9216). This means each layer has approximately 96 × 3 = 288 weight matrices for Q, K, V projections alone.

Hyperparameter Choices: The original Transformer used 8 heads with dimension 64 each (512 total). Modern models vary widely: Llama 2 uses 32 heads of 128 dimensions; GPT-4 reportedly uses larger dimensions with mixture-of-experts layers. The key insight is that total attention dimension should scale roughly with model width.

Positional Encoding

Attention is inherently permutation-invariant—it processes all positions simultaneously without regard to their order. To inject positional information, Transformers add positional encodings to the input embeddings.

The original paper used sinusoidal encodings based on sine and cosine functions of different frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    

This choice has a useful property: the encoding for any position can be represented as a linear function of encodings for earlier positions, making it easy for the model to learn relationships between positions.

Modern models often use Rotary Position Embeddings (RoPE), introduced with the LLaMA model, which encodes position information directly into the Q and K vectors through rotation. RoPE has become the dominant approach because it handles longer contexts better than sinusoidal encoding.

The Feed-Forward Network

Each Transformer layer contains a feed-forward network (FFN) applied to each position separately. This is typically a two-layer MLP with a GELU activation:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
    

In GPT models, this is often the larger portion of the computation. A 7B parameter model might have attention weights comprising about 1B parameters and FFN weights comprising 5.5B parameters—more than 75% of the model's capacity resides in these feed-forward layers.

Layer Normalization and Residual Connections

Around each subcomponent, Transformers employ residual connections (skip connections) and layer normalization. These stabilize training by ensuring that the gradient flow remains strong even through deep networks.

x = LayerNorm(x + Sublayer(x))
    

The residual connection allows gradients to flow directly through the network without passing through the sublayer, enabling training of much deeper models. Modern Transformer variants use pre-norm (normalizing before the sublayer rather than after), which improves training stability for very deep models.

Encoder vs. Decoder: What's the Difference?

The original Transformer had separate encoder and decoder stacks. Understanding their differences illuminates modern model architectures.

Encoder Stack

Decoder Stack

Model Architecture Layers Parameters Context Window
BERT-Base Encoder-only 12 110M 512
GPT-2 Decoder-only 48 1.5B 2048
GPT-3 Decoder-only 96 175B 2048
Llama 3 70B Decoder-only + RoPE 80 70B 8192

Practical Engineering Decisions

Memory and Compute Complexity

Standard attention has O(n²) memory and compute complexity with sequence length n—doubling the context length quadruples the attention computation. This makes very long contexts expensive. Flash Attention (Dao et al., 2022) dramatically reduces memory usage by computing attention in tiles that fit in GPU SRAM, achieving 2-4x speedup and enabling longer context training.

KV Cache for Inference

During autoregressive generation, the key and value vectors for all previous tokens must be recomputed for each new token. Modern inference systems cache these K/V vectors to avoid recomputation, significantly speeding up generation. This "KV cache" grows with conversation length and is a major factor in memory requirements for serving.

Variants and Optimizations

The original Transformer has spawned numerous variants addressing various limitations:

Conclusion

The Transformer architecture's success stems from its elegant combination of mechanisms: attention for relating positions, multi-head attention for capturing diverse relationships, positional encoding for injecting order, and feed-forward networks for computation. These components work together to create models that can process sequential data with unprecedented parallelism and scale.

Understanding these fundamentals helps practitioners make informed decisions about model selection, optimization, and debugging. As the architecture continues to evolve—with new attention variants, mixture-of-experts approaches, and improved positional encodings—the core principles remain essential knowledge for anyone working in modern AI.