Attention Is All You Need - Transformer Architecture

β€’ Updated November 5, 2024 β€’ 3 min read
β€’ #transformers #deep-learning #nlp
Authors: Vaswani et al.

Overview

The Transformer architecture, introduced in β€œAttention Is All You Need” (2017), fundamentally changed how we approach sequence-to-sequence tasks in deep learning[1].

Key Innovations

Self-Attention Mechanism

The core innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when encoding each word.

Key equation:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QQ = Query matrix
  • KK = Key matrix
  • VV = Value matrix
  • dkd_k = Dimension of key vectors

The scaling factor 1dk\frac{1}{\sqrt{d_k}} prevents the dot products from growing too large in magnitude[1].

Multi-Head Attention

Instead of performing a single attention function, the Transformer uses multiple attention β€œheads” in parallel:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Implementation Example

Here’s a simplified PyTorch implementation of scaled dot-product attention:

import torch
import torch.nn.functional as F
 
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix (batch_size, seq_len, d_k)
        K: Key matrix (batch_size, seq_len, d_k)
        V: Value matrix (batch_size, seq_len, d_v)
        mask: Optional mask (batch_size, seq_len, seq_len)
    
    Returns:
        attention_output, attention_weights
    """
    d_k = Q.size(-1)
    
    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    # Apply mask if provided
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Apply softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute weighted sum of values
    attention_output = torch.matmul(attention_weights, V)
    
    return attention_output, attention_weights
 
# Example usage
batch_size, seq_len, d_model = 2, 10, 512
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)
 
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

Output:

Output shape: torch.Size([2, 10, 512])
Attention weights shape: torch.Size([2, 10, 10])

Positional Encoding

Since Transformers have no recurrence, they inject positional information using sinusoidal functions:

PE(pos,2i)=sin⁑(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos⁑(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

where pospos is the position and ii is the dimension.

Why It Matters

  1. Parallelization: Unlike RNNs, Transformers can be fully parallelized
  2. Long-range dependencies: Better at capturing relationships across long sequences[2]
  3. Versatility: Foundation for GPT, BERT, and modern LLMs

Architecture Components

  • Encoder stack: 6 identical layers with self-attention and feed-forward networks
  • Decoder stack: 6 identical layers with masked self-attention
  • Positional encoding: Injects sequence order information
  • Feed-forward networks: Applied to each position separately

Complexity Analysis

Layer TypeComplexity per LayerSequential OperationsMaximum Path Length
Self-AttentionO(n2β‹…d)O(n^2 \cdot d)O(1)O(1)O(1)O(1)
RecurrentO(nβ‹…d2)O(n \cdot d^2)O(n)O(n)O(n)O(n)
ConvolutionalO(kβ‹…nβ‹…d2)O(k \cdot n \cdot d^2)O(1)O(1)O(log⁑k(n))O(\log_k(n))

Where nn is sequence length, dd is representation dimension, and kk is kernel size.

Impact

This paper spawned an entire generation of models:

  • BERT (2018) - Bidirectional encoder representations
  • GPT series (2018+) - Generative pre-trained transformers
  • T5 (2019) - Text-to-text transfer transformer
  • Vision Transformers (2020) - Applied to computer vision
  • And many more…

The architecture is now the foundation of modern NLP and beyond[3].

Key Takeaways

  1. Self-attention allows the model to attend to all positions in constant time
  2. Multi-head attention provides multiple representation subspaces
  3. Positional encodings preserve sequence information without recurrence
  4. The architecture is highly parallelizable, leading to faster training

Check out the video above for a visual explanation of how Transformers work!

πŸ“š References (3)
  1. [1]Vaswani et al. (2017). Attention Is All You Need
  2. [2]Bahdanau et al. (2014). Neural Machine Translation
  3. [3]Sutskever et al. (2014). Sequence to Sequence Learning with Neural Networks

License: This post is licensed under CC BY 4.0 by Bradley Ho Β© 2025.

You're free to share and adapt this content with attribution.