What Are Large Language Models?
Large Language Models (LLMs) like GPT-4 and Claude are decoder-only Transformers trained on massive text corpora. Their fundamental capability is surprisingly simple: predict the next token. Yet from this simple objective, they develop reasoning, coding, translation, and creative abilities.
An LLM doesn't "understand" language the way humans do. It has learned statistical patterns from trillions of words. But these patterns are so rich and deep that the model can generate coherent text, answer questions, write code, and even reason about novel problems.
The LLM Pipeline
Input Text — "The capital of France is"
Tokenizer — Split into tokens: ["The", " capital", " of", " France", " is"]
Embeddings — Convert tokens to dense vectors
N Transformer Blocks — Self-attention + FFN × 96 layers
Logits — Score for every token in vocabulary
Sampling — Select next token ("Paris")
Tokenization
LLMs don't work with raw characters or whole words. They use subword tokenization (like BPE — Byte Pair Encoding) that splits text into meaningful chunks. Common words are single tokens; rare words are broken into subwords.
| Text | Tokens | Token Count |
|---|---|---|
| "Hello world" | ["Hello", " world"] | 2 |
| "unbelievable" | ["un", "believ", "able"] | 3 |
| "GPT-4 is great!" | ["G", "PT", "-", "4", " is", " great", "!"] | 7 |
| "こんにちは" | ["こん", "にち", "は"] | 3 |
Typical LLM vocabulary sizes: GPT-4 has ~100,000 tokens. Each token is roughly 3-4 characters in English. This balance lets the model handle any text while keeping sequence lengths manageable.
Next-Token Prediction
The training objective is deceptively simple: given all previous tokens, predict the next one. The model outputs a probability distribution over its entire vocabulary, and the loss function (cross-entropy) pushes the model to assign high probability to the correct next token.
During inference, the model generates text one token at a time in an autoregressive loop: predict token → append to input → predict next token → repeat.
Sampling Strategies
When the model outputs logits (scores for each token), we need to choose which token to actually generate. Different sampling strategies control the creativity and randomness of the output:
| Strategy | How It Works | Effect |
|---|---|---|
| Greedy | Always pick the highest probability token | Deterministic, repetitive, boring |
| Temperature | Divide logits by T before softmax | T < 1: sharper (more focused); T > 1: flatter (more random) |
| Top-k | Only consider the top k tokens | k=10: choose from 10 best candidates |
| Top-p (Nucleus) | Consider tokens until cumulative prob ≥ p | p=0.9: dynamic number of candidates, covers 90% probability mass |
Rule of thumb: Temperature 0.0-0.3 for factual/code tasks (focused). Temperature 0.7-1.0 for creative writing (varied). Top-p of 0.9-0.95 is a good default for most use cases.
Scale and Emergent Abilities
LLMs exhibit emergent abilities — capabilities that appear suddenly at certain scales. A model with 1B parameters can't do arithmetic, but at 100B+ parameters, it suddenly can. These emergent behaviors include:
- In-context learning: Learning from examples in the prompt without weight updates
- Chain-of-thought reasoning: Step-by-step logical reasoning
- Code generation: Writing and debugging functional code
- Translation: Translating between languages not explicitly paired in training
- Instruction following: Understanding and executing complex natural language instructions
Key Takeaways
- LLMs are decoder-only Transformers trained on next-token prediction
- Tokenization converts text to subword tokens (BPE) — roughly 3-4 chars each
- Generation is autoregressive: predict one token, append, repeat
- Temperature, top-k, and top-p control the randomness of generated text
- Emergent abilities appear at large scales — in-context learning, reasoning, coding