Introduction to System Reliability

SLAs, SLOs, SLIs, error budgets, and the Site Reliability Engineering (SRE) philosophy — how to define and measure reliability for production systems.

Intermediate · 17 min read

SLI, SLO, and SLA

Term	Meaning	Example
SLI (Service Level Indicator)	A measured metric of service behaviour	Request success rate = successful_requests / total_requests
SLO (Service Level Objective)	Internal target for an SLI	Success rate ≥ 99.9% over 30 days
SLA (Service Level Agreement)	External contract with penalties for breach	If uptime < 99.9%, customer gets 25% credit

The Nines

Availability	Monthly Downtime	Yearly Downtime	Achievable With
99% (2 nines)	7.3 hours	3.65 days	Single server, manual deploys
99.9% (3 nines)	43.8 min	8.76 hours	Basic HA with load balancer
99.95%	21.9 min	4.38 hours	Multi-AZ, auto-failover
99.99% (4 nines)	4.4 min	52.6 min	Active-active multi-region
99.999% (5 nines)	26 sec	5.26 min	Extremely complex — Google-scale

Error Budgets

An error budget is the amount of unreliability you are allowed before breaching your SLO. If your SLO is 99.9%, you have a 0.1% error budget (43.8 min/month). Teams can spend error budget on risky releases; when it runs out, feature deployments pause until reliability improves.

TIP: Error budgets align engineering and product: reliability isn't just "ops' problem" — if product ships too fast and burns the budget, new features pause. This incentivises reliability at the team level.

Common SLIs to Measure

Availability — % of time the service returns non-5xx responses
Latency — % of requests served in < N ms (e.g. 95% under 200ms)
Error rate — % of requests returning 5xx
Saturation — % CPU / memory / queue depth utilised
Freshness — for data pipelines, how old is the latest processed record?

Part of the System Design series on Tekivex. Browse all tutorials or explore our open-source products.