10 March 2026
Understanding Transformers: The Architecture Behind Modern NLP
A deep dive into the self-attention mechanism and how transformers replaced recurrent networks to become the foundation of modern language models.
What is a Transformer?
The Transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al., replaced recurrent networks with a fully attention-based design. This shift enabled massive parallelisation during training and led directly to models like BERT, GPT, and modern LLMs.
The Problem with RNNs
Recurrent neural networks process sequences token by token. Each hidden state depends on the previous one, which means:
- Training is inherently sequential and slow to parallelise
- Gradients vanish or explode over long sequences
- Long-range dependencies are hard to capture
LSTMs and GRUs alleviate these issues, but don’t eliminate them.
Self-Attention
The core innovation of the Transformer is self-attention — each token in the input can attend to every other token directly, regardless of position.
For an input sequence of embeddings X, we compute three projections:
Q=XWQ,K=XWK,V=XWV
The attention output is:
Attention(Q,K,V)=softmax(dkQKT)V
The dk scaling prevents the dot products from growing too large in high dimensions.
Multi-Head Attention
Rather than running a single attention function, the Transformer runs h attention heads in parallel, each learning a different representation subspace:
MultiHead(Q,K,V)=Concat(head1,…,headh)WO
This allows the model to jointly attend to information from different positions and representation subspaces.
Positional Encoding
Because self-attention is permutation-invariant, we need to inject position information. The original paper uses fixed sinusoidal encodings:
PE(pos,2i)=sin(100002i/dmodelpos)
PE(pos,2i+1)=cos(100002i/dmodelpos)
Modern models like BERT and RoPE-based transformers use learned or rotary positional embeddings instead.
The Full Architecture
The encoder stacks N identical layers, each containing:
- Multi-head self-attention
- Position-wise feed-forward network
- Layer normalisation and residual connections around both
The decoder adds a third sublayer: cross-attention over the encoder output.
A Minimal Example in PyTorch
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.out = nn.Linear(d_model, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, T, C = x.shape
qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
q, k, v = qkv.unbind(dim=2) # each (B, T, H, d_k)
q, k, v = [t.transpose(1, 2) for t in (q, k, v)] # (B, H, T, d_k)
scale = self.d_k ** -0.5
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(dim=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return self.out(out)
Key Takeaways
- Transformers replace sequential recurrence with parallel attention
- Self-attention captures global dependencies in O(n2) time
- Multi-head attention learns diverse representation subspaces
- Positional encodings compensate for the permutation-invariance of attention
- The architecture scales remarkably well with data and compute