Hyperparameter Tuning

Learning rate, batch size, and architecture choices — grid search, random search, and Bayesian optimization.

Intermediate · 14 min read

What Are Hyperparameters?

Hyperparameters are settings you choose before training — they control how the model learns, not what it learns. Unlike model parameters (weights), hyperparameters aren't learned from data; they're set by the engineer.

Hyperparameter	What It Controls	Typical Range	Impact
Learning Rate	Step size for weight updates	1e-5 to 1e-1	Most important — too high: diverge, too low: stuck
Batch Size	Samples per gradient update	16 to 512	Large: stable but may generalize worse
Epochs	Passes through entire dataset	10 to 1000	Too few: underfit, too many: overfit
Hidden Layers	Network depth	1 to 100+	Deeper = more capacity, harder to train
Neurons per Layer	Layer width	32 to 4096	Wider = more capacity, more memory
Dropout Rate	Fraction of neurons to randomly disable	0.1 to 0.5	Regularization — prevents overfitting

The Learning Rate — Most Critical Hyperparameter

The learning rate controls how much weights change each step. Getting it right is the single most important tuning decision.

Analogy: Learning rate is like the stride length when walking down a hill. Too big — you overshoot the valley and bounce around. Too small — you take forever to reach the bottom. Just right — smooth, efficient descent.

Too high (0.1+): Loss oscillates wildly or explodes to NaN
Too low (1e-6): Training takes forever, may get stuck in local minima
Sweet spot (1e-4 to 1e-3): Smooth convergence for most problems
Learning rate schedule: Start high, reduce over time (warmup + decay)

Search Strategies

Grid Search

Try every combination of preset values
Exhaustive but exponentially expensive
Good for 2-3 hyperparameters
Wastes time on unimportant dimensions

Random Search

Sample random combinations
More efficient — explores diverse values
Better for high-dimensional spaces
Finds good solutions faster than grid

Bayesian Optimization is the smartest approach: it builds a model of the objective function and strategically picks the next hyperparameters to try. It learns from previous trials, converging on optimal values much faster.

Hyperparameter Search Implementation

import numpy as np
from itertools import product

def train_and_evaluate(lr: float, batch_size: int, hidden: int) -> float:
    """Simulate training — returns validation accuracy."""
    # In real code, this trains a model and returns val accuracy
    # Here we simulate with a function that has an optimum
    score = (
        -10 * (np.log10(lr) + 3.0) ** 2    # optimum lr ~= 1e-3
        - 0.001 * (batch_size - 64) ** 2     # optimum batch ~= 64
        - 0.0001 * (hidden - 128) ** 2       # optimum hidden ~= 128
        + 95.0                                # max accuracy
        + np.random.randn() * 0.5            # noise
    )
    return max(0, min(100, score))

# --- Grid Search ---
print("=== Grid Search ===")
best_score, best_params = 0, {}
lr_options = [1e-4, 1e-3, 1e-2]
batch_options = [32, 64, 128]
hidden_options = [64, 128, 256]

trials = 0
for lr, bs, hid in product(lr_options, batch_options, hidden_options):
    score = train_and_evaluate(lr, bs, hid)
    trials += 1
    if score > best_score:
        best_score = score
        best_params = {'lr': lr, 'batch': bs, 'hidden': hid}

print(f"Grid: {trials} trials, best={best_score:.2f}%, params={best_params}")

# --- Random Search ---
print("\n=== Random Search ===")
best_score, best_params = 0, {}
for i in range(27):  # same budget as grid
    lr  = 10 ** np.random.uniform(-4, -1)
    bs  = int(2 ** np.random.uniform(4, 8))     # 16 to 256
    hid = int(2 ** np.random.uniform(5, 9))      # 32 to 512
    score = train_and_evaluate(lr, bs, hid)
    if score > best_score:
        best_score = score
        best_params = {'lr': f'{lr:.5f}', 'batch': bs, 'hidden': hid}

print(f"Random: 27 trials, best={best_score:.2f}%, params={best_params}")

Practical Tuning Tips

Start with published defaults (Adam lr=1e-3, batch=32, dropout=0.1)
Tune learning rate first — it has the biggest impact
Use random search over grid search — more efficient for 3+ hyperparameters
Use a learning rate finder: increase lr exponentially, plot loss, pick the steepest descent
Monitor validation loss for early stopping — stop training when it starts rising
Use cross-validation for robust estimates, especially with small datasets

For production systems, consider automated tools like Optuna (Bayesian optimization), Ray Tune (distributed tuning), or W&B Sweeps (experiment tracking + tuning combined).

Key Takeaways

Hyperparameters control how the model learns — learning rate is the most critical
Grid search is exhaustive but scales poorly; random search is more efficient
Bayesian optimization learns from past trials to find optimal values faster
Start with established defaults, then tune the most impactful hyperparameters first

Part of the Training, Optimization & Deployment series on Tekivex. Browse all tutorials or explore our open-source products.