The Model Training Pipeline

Full production ML pipeline from data collection through deployment and monitoring — every step explained.

Intermediate · 16 min read

The Full ML Pipeline

Building a production ML system is much more than training a model. Data collection, cleaning, feature engineering, validation, deployment, and monitoring each require careful attention. In production, the model code is often less than 5% of the total system.

Data Collection — Gather raw data from APIs, DBs, logs

Data Cleaning — Fix missing values, remove duplicates

Augmentation — Expand dataset with synthetic variations

Feature Engineering — Create informative input features

Train — Fit model on training set

Validate — Tune hyperparameters on val set

Test — Final evaluation on held-out data

Deploy — Serve model in production

Monitor — Track drift, errors, performance

Data Collection

The quality of your data determines the ceiling of your model's performance. No amount of training can fix bad data.

APIs & Databases: Structured data from internal systems
Web Scraping: Collecting public data (respecting robots.txt)
User-Generated: Labels from annotations, feedback, interactions
Synthetic Data: Generated data to fill gaps or handle rare cases
Third-Party Datasets: Pre-existing datasets (Kaggle, HuggingFace, etc.)

Data Cleaning & Preprocessing

Problem	Solution	Tool/Method
Missing values	Impute with mean/median/mode or drop	pandas fillna(), SimpleImputer
Duplicates	Identify and remove duplicate records	pandas drop_duplicates()
Outliers	Cap at percentiles or remove	IQR method, Z-score filtering
Inconsistent formats	Standardize dates, categories, units	Custom parsing, regex
Class imbalance	Oversample minority, undersample majority	SMOTE, random oversampling

Feature Engineering

Feature engineering transforms raw data into informative inputs that help the model learn. Good features can make a simple model outperform a complex one with raw features.

import numpy as np

# Example: feature engineering for house price prediction
def engineer_features(data: dict) -> dict:
    """Transform raw data into ML-ready features."""
    features = {}

    # Numeric: normalize to [0, 1]
    features['sqft_norm'] = data['sqft'] / 5000.0
    features['bedrooms_norm'] = data['bedrooms'] / 10.0

    # Derived: create new informative features
    features['price_per_sqft'] = data['price'] / max(data['sqft'], 1)
    features['room_ratio'] = data['bedrooms'] / max(data['bathrooms'], 1)
    features['age'] = 2026 - data['year_built']

    # Binning: convert continuous to categorical
    features['age_bucket'] = (
        'new' if features['age'] < 10
        else 'mid' if features['age'] < 30
        else 'old'
    )

    # Interaction: combine features
    features['size_x_bedrooms'] = features['sqft_norm'] * features['bedrooms_norm']

    return features

# Raw data
house = {'sqft': 2000, 'bedrooms': 3, 'bathrooms': 2,
         'price': 450000, 'year_built': 2005}

features = engineer_features(house)
for k, v in features.items():
    print(f"  {k}: {v}")

Data Versioning

Just like code has Git, ML datasets need versioning. When you retrain a model and get different results, you need to know whether the data changed. Tools like DVC (Data Version Control) track dataset versions alongside code.

Reproducibility crisis: Without data versioning, you can't reproduce results. If training data changes silently (new records, removed outliers), the same code will produce a different model. Always version your data.

Deployment & Monitoring

Deploying a model is just the beginning. Models degrade over time as real-world data drifts from training data.

Model Serving: REST API, batch prediction, or edge deployment
Data Drift: Monitor if input distribution changes (new user behavior)
Concept Drift: Monitor if the relationship between inputs and outputs changes
Performance Metrics: Track accuracy, latency, throughput in production
Retraining Triggers: Automatic retraining when performance drops below threshold

Key Takeaways

Production ML is 95% data engineering and infrastructure, 5% model code
Feature engineering can be more impactful than choosing a fancier model
Always version your data alongside your code — reproducibility is critical
Models degrade over time — monitor for data drift and concept drift
The pipeline is a cycle: deploy → monitor → retrain → redeploy

Part of the Training, Optimization & Deployment series on Tekivex. Browse all tutorials or explore our open-source products.