RNNs & Sequence Models

Recurrent neural networks, vanishing gradients, LSTM cells, and comparison of sequence model architectures.

Intermediate · 16 min read

Why Sequences Need Special Networks

Standard neural networks process each input independently — they have no memory. But many real-world problems involve sequences where order matters: text (word order), time series (temporal patterns), audio (sound over time), and video (frames over time). Recurrent Neural Networks (RNNs) solve this by maintaining a hidden state that acts as memory.

TIP: Analogy: Reading a sentence word by word. When you reach the word "bank," you need context from earlier words to know if it means a river bank or a financial bank. RNNs keep a running summary of everything they've seen so far.

How RNNs Work

At each time step, the RNN takes two inputs: the current data point x(t) and the previous hidden state h(t-1). It produces a new hidden state h(t) and optionally an output. The hidden state carries information forward through the sequence.

Flow:

x(t) — Current input (e.g., word embedding)
h(t-1) — Previous hidden state (memory)
RNN Cell — h(t) = tanh(W_h · h(t-1) + W_x · x(t) + b)
h(t) — New hidden state → next step
Output — Optional: y(t) = W_y · h(t)

The Vanishing Gradient Problem

When sequences are long, gradients must flow backward through many time steps during backpropagation. With each step, gradients get multiplied by the same weight matrix — if values are small, gradients vanish (approach zero), and the network can't learn long-range dependencies.

CAUTION: A vanilla RNN struggles to connect information more than ~10-20 time steps apart. If a sentence starts with "The cat, which ate the fish that was caught by the fisherman who lived near the..." — by the time the verb arrives, the RNN has forgotten the subject.

LSTM — Long Short-Term Memory

LSTMs solve the vanishing gradient problem with a clever architecture: they add a cell state (long-term memory highway) and three gates that control what to remember, what to forget, and what to output.

Forget Gate: Decides what to remove from cell state ("forget this old info")
Input Gate: Decides what new info to store ("remember this")
Output Gate: Decides what part of cell state to output ("share this")
Cell State: Highway that carries info across many time steps with minimal modification

RNN vs LSTM vs GRU

Feature	Vanilla RNN	LSTM	GRU
Memory mechanism	Single hidden state	Cell state + hidden state	Combined hidden state
Gates	None	3 (forget, input, output)	2 (reset, update)
Long-range dependencies	Poor (vanishing gradient)	Good (cell state highway)	Good (simpler than LSTM)
Parameters	Fewest	Most (3 gates × weights)	Fewer than LSTM
Training speed	Fast but unstable	Slower but stable	Middle ground
Use when	Short sequences only	Default choice, long sequences	Want LSTM-like perf, fewer params

RNN / LSTM / GRU	Transformers
Process sequences one step at a time	Process entire sequence at once (parallel)
Inherently sequential — hard to parallelize	Highly parallelizable — faster training
Good for short-to-medium sequences	Handle very long sequences with attention
Established, well-understood	State of the art for NLP, vision, audio
Largely replaced by Transformers for NLP	Require more data and compute

When to Use Sequence Models

Time series forecasting — Stock prices, weather, sensor data
Text generation — Character or word-level language models
Speech recognition — Converting audio waveforms to text
Machine translation — Sequence-to-sequence (now mostly Transformers)
Music generation — Creating melodies note by note

NOTE: While Transformers have largely replaced RNNs for NLP tasks, LSTMs and GRUs remain useful for time-series data, on-device ML (smaller models), and situations where you need to process data one step at a time.

Key Takeaways

RNNs process sequences by maintaining a hidden state (memory) across time steps
Vanilla RNNs suffer from vanishing gradients — they forget long-range context
LSTMs add cell state + gates to carry information across long sequences
GRUs are a simpler, often equally effective alternative to LSTMs
Transformers have largely replaced RNNs for most modern NLP tasks

Part of the AI & Machine Learning series on Tekivex. Browse all tutorials or explore our open-source products.