Availability & Reliability

Understand SLAs, SLOs, and SLIs, the difference between high availability and disaster recovery, and how redundancy and failover keep systems running.

Intermediate · 14 min read

Defining Availability

Availability measures the percentage of time a system is operational and accessible. It is expressed as "nines" — 99.9% (three nines) means roughly 8.7 hours of downtime per year.

Nines	Availability	Downtime / Year	Downtime / Month
Two nines	99%	3.65 days	7.3 hours
Three nines	99.9%	8.76 hours	43.8 minutes
Four nines	99.99%	52.6 minutes	4.38 minutes
Five nines	99.999%	5.26 minutes	26.3 seconds

NOTE: Each additional nine is exponentially harder (and more expensive) to achieve. Going from 99.9% to 99.99% often requires completely re-architecting your system.

SLA, SLO, and SLI

SLI (Service Level Indicator) — a measured metric, e.g. request latency p99
SLO (Service Level Objective) — the target value for an SLI, e.g. p99 latency < 200ms
SLA (Service Level Agreement) — a contract with consequences if SLOs are missed

Redundancy & Failover

Redundancy means having backup components ready to take over when a primary fails. Failover is the process of switching to the backup. Think of a hospital generator: when the main power grid goes down, the generator kicks in automatically.

Flow:

Primary Server — Handles all production traffic
Heartbeat Check — Monitors primary health every few seconds
Failure Detected — Primary stops responding
Standby Promoted — Hot standby becomes new primary
Traffic Rerouted — DNS/LB points to new primary

Active-Passive vs Active-Active

Active-Passive (Hot Standby)	Active-Active
One server handles traffic	All servers handle traffic
Standby idles until failover	No idle resources
Simpler to implement	More complex (data sync)
Some downtime during switch	Near-zero downtime failover
Standby cost with no throughput benefit	Better resource utilization

High Availability vs Disaster Recovery

High Availability (HA)	Disaster Recovery (DR)
Prevents downtime within a region	Recovers from region-wide failures
Automatic failover in seconds	Can take minutes to hours
Redundant components in same data center	Backups in different geographic region
Goal: minimize unplanned downtime	Goal: recover from catastrophic events
Example: database replicas with auto-failover	Example: cross-region S3 replication

Measuring Reliability

Reliability is about the system producing correct results consistently. A system can be available (it responds) but unreliable (it returns wrong data). Key metrics include:

Metric	What It Measures
MTBF (Mean Time Between Failures)	Average time the system runs without failing
MTTR (Mean Time To Repair)	Average time to fix a failure
Error rate	Percentage of requests that return errors
Data durability	Probability that stored data is not lost (e.g., 99.999999999%)

TIP: Availability = MTBF / (MTBF + MTTR). To improve availability you can either increase MTBF (better hardware, fewer bugs) or decrease MTTR (faster detection, automated recovery).

Key Takeaways

Define your SLOs before designing the system.
Redundancy removes single points of failure.
Active-active gives better utilization but adds complexity.
HA protects within a region; DR protects across regions.

Part of the System Design series on Tekivex. Browse all tutorials or explore our open-source products.