Attention Is All You Need - Transformer Architecture
Overview
The Transformer architecture, introduced in βAttention Is All You Needβ (2017), fundamentally changed how we approach sequence-to-sequence tasks in deep learning[1].
Key Innovations
Self-Attention Mechanism
The core innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when encoding each word.
Key equation:
Where:
- = Query matrix
- = Key matrix
- = Value matrix
- = Dimension of key vectors
The scaling factor prevents the dot products from growing too large in magnitude[1].
Multi-Head Attention
Instead of performing a single attention function, the Transformer uses multiple attention βheadsβ in parallel:
where each head is:
Implementation Example
Hereβs a simplified PyTorch implementation of scaled dot-product attention:
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Compute scaled dot-product attention.
Args:
Q: Query matrix (batch_size, seq_len, d_k)
K: Key matrix (batch_size, seq_len, d_k)
V: Value matrix (batch_size, seq_len, d_v)
mask: Optional mask (batch_size, seq_len, seq_len)
Returns:
attention_output, attention_weights
"""
d_k = Q.size(-1)
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
# Apply mask if provided
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax
attention_weights = F.softmax(scores, dim=-1)
# Compute weighted sum of values
attention_output = torch.matmul(attention_weights, V)
return attention_output, attention_weights
# Example usage
batch_size, seq_len, d_model = 2, 10, 512
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")Output:
Output shape: torch.Size([2, 10, 512])
Attention weights shape: torch.Size([2, 10, 10])Positional Encoding
Since Transformers have no recurrence, they inject positional information using sinusoidal functions:
where is the position and is the dimension.
Why It Matters
- Parallelization: Unlike RNNs, Transformers can be fully parallelized
- Long-range dependencies: Better at capturing relationships across long sequences[2]
- Versatility: Foundation for GPT, BERT, and modern LLMs
Architecture Components
- Encoder stack: 6 identical layers with self-attention and feed-forward networks
- Decoder stack: 6 identical layers with masked self-attention
- Positional encoding: Injects sequence order information
- Feed-forward networks: Applied to each position separately
Complexity Analysis
| Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
|---|---|---|---|
| Self-Attention | |||
| Recurrent | |||
| Convolutional |
Where is sequence length, is representation dimension, and is kernel size.
Impact
This paper spawned an entire generation of models:
- BERT (2018) - Bidirectional encoder representations
- GPT series (2018+) - Generative pre-trained transformers
- T5 (2019) - Text-to-text transfer transformer
- Vision Transformers (2020) - Applied to computer vision
- And many moreβ¦
The architecture is now the foundation of modern NLP and beyond[3].
Key Takeaways
- Self-attention allows the model to attend to all positions in constant time
- Multi-head attention provides multiple representation subspaces
- Positional encodings preserve sequence information without recurrence
- The architecture is highly parallelizable, leading to faster training
Check out the video above for a visual explanation of how Transformers work!
π References (3)
- [1]Vaswani et al. (2017). Attention Is All You Need
- [2]Bahdanau et al. (2014). Neural Machine Translation
- [3]Sutskever et al. (2014). Sequence to Sequence Learning with Neural Networks
License: This post is licensed under CC BY 4.0 by Bradley Ho Β© 2025.
You're free to share and adapt this content with attribution.