Training Deep Neural Networks

Forward pass, backpropagation, loss functions, optimizers, and the full training process explained visually.

Intermediate · 20 min read

How Deep Networks Learn

Training a deep network is a four-step dance repeated thousands of times: push data forward through the network, measure the error, flow the error backward to compute gradients, and update the weights. This process — forward pass → loss → backward pass → update — is the foundation of all deep learning.

Flow:

Input — Feed batch of training data
Forward Pass — Compute predictions layer by layer
Loss Function — Measure prediction error
Backward Pass — Compute gradients via chain rule
Update Weights — Optimizer adjusts parameters

The Forward Pass

Data flows from input to output, layer by layer. Each layer applies: output = activation(weights × input + bias). The final layer produces the prediction.

NOTE: During the forward pass, we save intermediate values (activations) at each layer. These saved values are needed during backpropagation to compute gradients efficiently.

Loss Functions

The loss function measures how wrong the predictions are. Different tasks use different loss functions:

Loss Function	Formula	Task	Behavior
MSE	(1/n) Σ(ŷ - y)²	Regression	Penalizes large errors quadratically
MAE	(1/n) Σ	ŷ - y
Binary Cross-Entropy	-[y·log(ŷ) + (1-y)·log(1-ŷ)]	Binary classification	Heavy penalty for confident wrong predictions
Categorical Cross-Entropy	-Σ yᵢ·log(ŷᵢ)	Multi-class classification	Works with softmax output layer

Backpropagation

Backpropagation is how the network figures out which weights to blame for the error. It applies the chain rule of calculus to flow gradients backward from the loss through each layer, computing how much each weight contributed to the error.

TIP: Analogy: Imagine a factory assembly line where the final product has a defect. Backpropagation traces the defect backward through each station to figure out which workers (weights) made mistakes and how much to correct them.

import numpy as np

# Simple 2-layer network demonstrating backprop
np.random.seed(42)

# Network: 2 inputs → 3 hidden → 1 output
W1 = np.random.randn(2, 3) * 0.5   # input → hidden
b1 = np.zeros(3)
W2 = np.random.randn(3, 1) * 0.5   # hidden → output
b2 = np.zeros(1)

def relu(z):       return np.maximum(0, z)
def relu_grad(z):  return (z > 0).astype(float)
def sigmoid(z):    return 1 / (1 + np.exp(-z))

# Training data (XOR problem)
X = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=float)
y = np.array([[0], [1], [1], [0]], dtype=float)

lr = 0.5
for epoch in range(1000):
    # === FORWARD PASS ===
    z1 = X @ W1 + b1            # hidden pre-activation
    a1 = relu(z1)               # hidden activation
    z2 = a1 @ W2 + b2           # output pre-activation
    a2 = sigmoid(z2)            # output (prediction)

    # === LOSS ===
    loss = np.mean((a2 - y) ** 2)

    # === BACKWARD PASS (chain rule) ===
    dL_da2 = 2 * (a2 - y) / len(X)          # dLoss/dOutput
    da2_dz2 = a2 * (1 - a2)                  # sigmoid derivative
    dz2 = dL_da2 * da2_dz2                    # gradient at output

    dW2 = a1.T @ dz2                          # gradient for W2
    db2 = np.sum(dz2, axis=0)                 # gradient for b2

    da1 = dz2 @ W2.T                          # propagate to hidden
    dz1 = da1 * relu_grad(z1)                 # gradient at hidden

    dW1 = X.T @ dz1                           # gradient for W1
    db1 = np.sum(dz1, axis=0)                 # gradient for b1

    # === UPDATE WEIGHTS ===
    W2 -= lr * dW2;  b2 -= lr * db2
    W1 -= lr * dW1;  b1 -= lr * db1

    if epoch % 200 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}")

print(f"\nPredictions: {a2.flatten().round(2)}")
print(f"Expected:    {y.flatten()}")

Optimizers

Optimizers decide how to update weights using gradients. Vanilla SGD takes fixed-size steps; modern optimizers adapt the step size per parameter.

Optimizer	Key Idea	Pros	Cons
SGD	Fixed learning rate for all params	Simple, well-understood	Slow convergence, sensitive to lr
SGD + Momentum	Accumulates past gradients	Faster, smoother convergence	Extra hyperparameter (momentum)
RMSProp	Adapts lr per parameter	Good for non-stationary problems	Can diverge in some cases
Adam	Momentum + RMSProp combined	Fast, works well out-of-the-box	May generalize worse than SGD

TIP: Rule of thumb: Start with Adam (lr=0.001). If you need the best possible performance and have time to tune, switch to SGD with momentum and a learning rate schedule.

Epochs, Batches, and Iterations

Term	Definition	Example (1000 samples, batch=100)
Epoch	One full pass through the training data	1 epoch = seeing all 1000 samples
Batch	A subset of data processed at once	100 samples per batch
Iteration	One forward + backward pass on one batch	10 iterations per epoch

Key Takeaways

Training = forward pass → loss → backward pass → update, repeated many times
Backpropagation uses the chain rule to compute gradients layer by layer
MSE for regression, cross-entropy for classification
Adam optimizer is a great default; SGD+momentum for maximum performance
One epoch = one full pass through the dataset; batch size controls memory vs noise

Part of the AI & Machine Learning series on Tekivex. Browse all tutorials or explore our open-source products.