Authors: Vaswani et al.

Overview

The Transformer architecture, introduced in “Attention Is All You Need” (2017), fundamentally changed how we approach sequence-to-sequence tasks in deep learning[1].

Key Innovations

Self-Attention Mechanism

The core innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when encoding each word.

Key equation:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

$Q$ = Query matrix
$K$ = Key matrix
$V$ = Value matrix
$d_k$ = Dimension of key vectors

The scaling factor $\frac{1}{\sqrt{d_k}}$ prevents the dot products from growing too large in magnitude[1].

Multi-Head Attention

Instead of performing a single attention function, the Transformer uses multiple attention “heads” in parallel:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

where each head is:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Implementation Example

Here’s a simplified PyTorch implementation of scaled dot-product attention:

import torch
import torch.nn.functional as F
 
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix (batch_size, seq_len, d_k)
        K: Key matrix (batch_size, seq_len, d_k)
        V: Value matrix (batch_size, seq_len, d_v)
        mask: Optional mask (batch_size, seq_len, seq_len)
    
    Returns:
        attention_output, attention_weights
    """
    d_k = Q.size(-1)
    
    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    # Apply mask if provided
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Apply softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute weighted sum of values
    attention_output = torch.matmul(attention_weights, V)
    
    return attention_output, attention_weights
 
# Example usage
batch_size, seq_len, d_model = 2, 10, 512
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)
 
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

Output:

Output shape: torch.Size([2, 10, 512])
Attention weights shape: torch.Size([2, 10, 10])

Positional Encoding

Since Transformers have no recurrence, they inject positional information using sinusoidal functions:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

where $pos$ is the position and $i$ is the dimension.

Why It Matters

Parallelization: Unlike RNNs, Transformers can be fully parallelized
Long-range dependencies: Better at capturing relationships across long sequences[2]
Versatility: Foundation for GPT, BERT, and modern LLMs

Architecture Components

Encoder stack: 6 identical layers with self-attention and feed-forward networks
Decoder stack: 6 identical layers with masked self-attention
Positional encoding: Injects sequence order information
Feed-forward networks: Applied to each position separately

Complexity Analysis

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
Recurrent	$O(n \cdot d^2)$	$O(n)$	$O(n)$
Convolutional	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(\log_k(n))$

Where $n$ is sequence length, $d$ is representation dimension, and $k$ is kernel size.

Impact

This paper spawned an entire generation of models:

BERT (2018) - Bidirectional encoder representations
GPT series (2018+) - Generative pre-trained transformers
T5 (2019) - Text-to-text transfer transformer
Vision Transformers (2020) - Applied to computer vision
And many more…

The architecture is now the foundation of modern NLP and beyond[3].

Key Takeaways

Self-attention allows the model to attend to all positions in constant time
Multi-head attention provides multiple representation subspaces
Positional encodings preserve sequence information without recurrence
The architecture is highly parallelizable, leading to faster training

Check out the video above for a visual explanation of how Transformers work!