The Transformer Architecture

Complete Transformer breakdown — encoder, decoder, positional encoding, multi-head attention, and feed-forward networks.

Advanced · 20 min read

The Architecture That Changed AI

The Transformer, introduced in the landmark 2017 paper, replaced recurrence entirely with attention. It processes entire sequences in parallel, enabling massive speedups and paving the way for GPT, BERT, Claude, and every modern LLM. The key insight: you don't need recurrence or convolution — attention is all you need.

High-Level Architecture

The original Transformer has two halves: an Encoder (understands input) and a Decoder (generates output). Modern models often use only one: BERT uses the encoder; GPT/Claude use the decoder.

Input Tokens — Tokenized text sequence

Embedding + Position — Token embedding + positional encoding

Multi-Head Attention — Multiple parallel attention heads

Add & Norm — Residual connection + layer norm

Feed-Forward — Two linear layers with ReLU

Add & Norm — Residual connection + layer norm

Output — Contextualized representations

Positional Encoding

Since Transformers process all tokens simultaneously (no sequential order), they need a way to know where each token is in the sequence. Positional encodings are vectors added to token embeddings that encode position information using sine and cosine waves of different frequencies.

Analogy: Positional encoding is like adding a "seat number" to each guest at a dinner table. Without it, the model would see a bag of words with no order — "dog bites man" and "man bites dog" would look identical.

import numpy as np

def positional_encoding(seq_len: int, d_model: int) -> np.ndarray:
    """Generate sinusoidal positional encodings."""
    pos = np.arange(seq_len)[:, np.newaxis]       # (seq_len, 1)
    dim = np.arange(d_model)[np.newaxis, :]        # (1, d_model)

    # Frequency decreases with dimension
    angle = pos / (10000 ** (2 * (dim // 2) / d_model))

    # Even indices: sine, Odd indices: cosine
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(angle[:, 0::2])
    pe[:, 1::2] = np.cos(angle[:, 1::2])
    return pe

# Example: 10 positions, 8 dimensions
pe = positional_encoding(10, 8)
print("Shape:", pe.shape)  # (10, 8)
print("Position 0:", np.round(pe[0], 3))
print("Position 5:", np.round(pe[5], 3))

Multi-Head Attention

Instead of performing one attention operation, the Transformer runs multiple attention heads in parallel. Each head learns to focus on different relationships — one might attend to syntax, another to semantics, another to coreference. Their outputs are concatenated and projected.

Split Q, K, V into h heads (e.g., 8 or 12)
Each head performs scaled dot-product attention independently
Concatenate all head outputs
Project through a final linear layer to get the output

With 8 heads and d_model=512, each head operates on d_k=64 dimensions. This is more expressive than a single head with 512 dimensions because different heads can capture different types of relationships.

Feed-Forward Network

After attention, each position passes through a feed-forward network (FFN) — two linear layers with a ReLU activation. This processes each position independently and adds non-linearity. The inner dimension is typically 4× the model dimension (e.g., 2048 for d_model=512).

Residual Connections & Layer Normalization

Two critical techniques make deep Transformers trainable:

Residual Connections: output = layer(x) + x — lets gradients flow directly through the network, preventing vanishing gradients
Layer Normalization: Normalizes activations to have zero mean and unit variance, stabilizing training

Encoder vs Decoder

Encoder (BERT-style)

Bidirectional — sees full input
Self-attention attends to all positions
Used for understanding (classification, NER)
Examples: BERT, RoBERTa, DeBERTa

Decoder (GPT-style)

Autoregressive — sees only past tokens
Causal mask prevents looking ahead
Used for generation (text, code)
Examples: GPT-4, Claude, LLaMA

Transformer vs RNN

Aspect	Transformer	RNN/LSTM
Parallelism	Fully parallel (all tokens at once)	Sequential (one token at a time)
Long-range deps	Direct attention to any position	Decays with distance
Training speed	Much faster (GPU-friendly)	Slower (sequential bottleneck)
Memory	O(n²) for sequence length n	O(n) — more memory-efficient
Inductive bias	None — must learn everything from data	Sequential bias built in

Key Takeaways

Transformers replace recurrence with self-attention — processing all tokens in parallel
Positional encodings inject sequence order information into the model
Multi-head attention lets the model focus on different relationship types simultaneously
Residual connections + layer norm make deep stacking possible
Encoder = bidirectional understanding; Decoder = autoregressive generation

Part of the Transformers & Large Language Models series on Tekivex. Browse all tutorials or explore our open-source products.