The Full ML Pipeline
Building a production ML system is much more than training a model. Data collection, cleaning, feature engineering, validation, deployment, and monitoring each require careful attention. In production, the model code is often less than 5% of the total system.
Data Collection — Gather raw data from APIs, DBs, logs
Data Cleaning — Fix missing values, remove duplicates
Augmentation — Expand dataset with synthetic variations
Feature Engineering — Create informative input features
Train — Fit model on training set
Validate — Tune hyperparameters on val set
Test — Final evaluation on held-out data
Deploy — Serve model in production
Monitor — Track drift, errors, performance
Data Collection
The quality of your data determines the ceiling of your model's performance. No amount of training can fix bad data.
- APIs & Databases: Structured data from internal systems
- Web Scraping: Collecting public data (respecting robots.txt)
- User-Generated: Labels from annotations, feedback, interactions
- Synthetic Data: Generated data to fill gaps or handle rare cases
- Third-Party Datasets: Pre-existing datasets (Kaggle, HuggingFace, etc.)
Data Cleaning & Preprocessing
| Problem | Solution | Tool/Method |
|---|---|---|
| Missing values | Impute with mean/median/mode or drop | pandas fillna(), SimpleImputer |
| Duplicates | Identify and remove duplicate records | pandas drop_duplicates() |
| Outliers | Cap at percentiles or remove | IQR method, Z-score filtering |
| Inconsistent formats | Standardize dates, categories, units | Custom parsing, regex |
| Class imbalance | Oversample minority, undersample majority | SMOTE, random oversampling |
Feature Engineering
Feature engineering transforms raw data into informative inputs that help the model learn. Good features can make a simple model outperform a complex one with raw features.
import numpy as np
# Example: feature engineering for house price prediction
def engineer_features(data: dict) -> dict:
"""Transform raw data into ML-ready features."""
features = {}
# Numeric: normalize to [0, 1]
features['sqft_norm'] = data['sqft'] / 5000.0
features['bedrooms_norm'] = data['bedrooms'] / 10.0
# Derived: create new informative features
features['price_per_sqft'] = data['price'] / max(data['sqft'], 1)
features['room_ratio'] = data['bedrooms'] / max(data['bathrooms'], 1)
features['age'] = 2026 - data['year_built']
# Binning: convert continuous to categorical
features['age_bucket'] = (
'new' if features['age'] < 10
else 'mid' if features['age'] < 30
else 'old'
)
# Interaction: combine features
features['size_x_bedrooms'] = features['sqft_norm'] * features['bedrooms_norm']
return features
# Raw data
house = {'sqft': 2000, 'bedrooms': 3, 'bathrooms': 2,
'price': 450000, 'year_built': 2005}
features = engineer_features(house)
for k, v in features.items():
print(f" {k}: {v}")
Data Versioning
Just like code has Git, ML datasets need versioning. When you retrain a model and get different results, you need to know whether the data changed. Tools like DVC (Data Version Control) track dataset versions alongside code.
Reproducibility crisis: Without data versioning, you can't reproduce results. If training data changes silently (new records, removed outliers), the same code will produce a different model. Always version your data.
Deployment & Monitoring
Deploying a model is just the beginning. Models degrade over time as real-world data drifts from training data.
- Model Serving: REST API, batch prediction, or edge deployment
- Data Drift: Monitor if input distribution changes (new user behavior)
- Concept Drift: Monitor if the relationship between inputs and outputs changes
- Performance Metrics: Track accuracy, latency, throughput in production
- Retraining Triggers: Automatic retraining when performance drops below threshold
Key Takeaways
- Production ML is 95% data engineering and infrastructure, 5% model code
- Feature engineering can be more impactful than choosing a fancier model
- Always version your data alongside your code — reproducibility is critical
- Models degrade over time — monitor for data drift and concept drift
- The pipeline is a cycle: deploy → monitor → retrain → redeploy