Achieving 99.95% Uptime: Building Self-Healing Infrastructure for 200+ Microservices
A complete technical guide to architecting, deploying, and operating 200+ microservices with 99.95% uptime (4.4 hours downtime per year). Covers reliability engineering principles, multi-region architecture, observability at scale, self-healing automation, chaos engineering, and incident response. Includes detailed code examples, diagrams, and a proven roadmap from 98.2% to 99.95% uptime.
Tech Stack:
Introduction: The Cost of Downtime and the Path to 99.95% Uptime
In modern digital commerce and software-as-a-service (SaaS) businesses, downtime is not merely an operational inconvenience—it is a direct financial loss with cascading consequences. Every minute of unavailability translates into lost transactions, damaged customer relationships, brand reputation impact, and regulatory non-compliance.
The Financial Impact of Downtime
Consider a typical enterprise SaaS application serving thousands of customers:
- Enterprise customers generating $50K–$500K in annual recurring revenue (ARR)
- Average downtime impact: $5,600 per minute across the customer base
- Broken down: 1–2 minutes of downtime = $5,600–$11,200 in lost revenue, customer churn risk, support costs, and brand damage
For a global platform with operations across multiple regions:
- 5 minutes of unplanned downtime = $28,000 in immediate impact
- 1 hour of downtime = $336,000 in direct and indirect costs
- 8 hours of downtime (worst case) = $2.7 million
Beyond financial loss, downtime causes:
- Customer churn: 23% of customers will switch to competitors after a single hour of downtime
- Support load: Spike in support tickets, angry customers, brand reputation damage
- Employee morale: Engineering teams stress, blame culture, burnout
- Regulatory implications: SLA breaches, penalty clauses, potential legal liability
This financial reality makes reliability engineering not just a technical practice, but a core business imperative.
Baseline State: 98.2% Uptime and the Cost of Drift
The organization at the center of this case study began with 98.2% uptime—a respectable number on the surface but revealing when examined in detail:
98.2% uptime translates to:
- 126 hours of downtime per year (approximately 5.25 days)
- ~10 hours per month of unplanned outages
- ~150 minutes per week of service disruption on average
For a company operating 200+ interdependent microservices, this level of reliability was insufficient for:
- Enterprise SLA commitments (customers demanding 99.9%+ uptime)
- Competitive positioning (competitors achieving 99.99%)
- Board-level confidence (C-suite viewing reliability as competitive disadvantage)
- Employee satisfaction (on-call teams exhausted by frequent incidents)
Target: 99.95% Uptime and the Acceptable Downtime Budget
The organization set a bold but achievable target: 99.95% uptime, commonly referred to as "four nines and five."
99.95% uptime translates to:
- 26.3 hours of downtime per year (~2.2 days)
- ~2.2 hours per month of allowed downtime
- ~5 minutes per week of acceptable downtime
This is a critical distinction from 98.2%: the organization effectively cut its acceptable downtime from 126 hours to 26 hours per year—a reduction of 79%.
For context:
| Uptime % | Minutes/Year | Hours/Year | Days/Year | Days/Month |
|---|---|---|---|---|
| 99.0% | 5,256 | 87.6 | 3.65 | 0.30 |
| 99.5% | 2,628 | 43.8 | 1.83 | 0.15 |
| 99.9% | 526 | 8.76 | 0.37 | 0.03 |
| 99.95% | 263 | 4.4 | 0.18 | 0.015 |
| 99.99% | 52.6 | 0.88 | 0.04 | 0.003 |
Reaching 99.95% required architectural changes, operational discipline, and engineering rigor across every layer of the infrastructure.
The Challenge: Complexity at Scale
The organization operated 200+ microservices across multiple environments:
- Production microservices: 180+ services
- Infrastructure services: 35+ (databases, message queues, caches, monitoring)
- Deployment frequency: 400–600 deployments per week
- Peak traffic: 50,000 requests per second
- Geographic distribution: 4 regions, 12 availability zones
- Engineering teams: 150+ engineers across 20+ teams
Key challenges in this environment:
- Distributed complexity: 200+ services = 200+ potential points of failure
- Dependency chains: Critical requests touch 8–15 services; failure at any point breaks the chain
- Unequal reliability: Some services are battle-tested; others relatively new and fragile
- Organizational silos: Teams own services independently without visibility into how their reliability affects others
- Operational burden: Growing on-call load, alert fatigue, manual remediation consuming thousands of engineering hours annually
Why Traditional Monitoring Isn't Enough
Before addressing the solution, it's important to understand why traditional monitoring and alerting, while necessary, is insufficient for achieving 99.95% uptime:
Problem 1: Reactive vs Proactive
Traditional monitoring detects problems after they occur. A service fails, Prometheus alert fires, on-call engineer wakes up, investigates, and fixes.
Timeline: Failure → Detection (2–5 min) → Diagnosis (5–15 min) → Remediation (10–30 min) = 17–50 minute MTTR (Mean Time To Recovery)
At 99.95% uptime, you have only 5 minutes of acceptable downtime per week. A single incident often exceeds this budget.
Problem 2: Alert Fatigue
Monitoring 200+ services at traditional thresholds generates hundreds of alerts per day. Teams ignore alerts (alert fatigue), miss critical ones, and spend time managing noise instead of fixing root causes.
Problem 3: Invisible Dependencies
A service's degradation may not trigger its own alerts but affects downstream services. Traditional monitoring doesn't show the dependency graph or explain why a customer is experiencing poor performance.
Problem 4: Manual Remediation
Most outages are resolved through manual intervention—restarts, failovers, config updates. This process is:
- Slow (10–60 minute MTTR)
- Error-prone (risk of wrong commands, incomplete fixes)
- Inconsistent (different engineers follow different procedures)
The Solution: Self-Healing, Autonomous Infrastructure
To achieve 99.95% uptime with 200+ microservices, the organization implemented an autonomous, self-healing infrastructure with these characteristics:
- Automatic failure detection and remediation (eliminating manual steps)
- Redundancy at every layer (no single point of failure)
- Graceful degradation (system prioritizes critical services, sheds load gracefully)
- Observability at scale (visibility into every service, request, and failure)
- Predictive health management (fixing problems before they cause outages)
- Rapid rollback (automated or one-click recovery from bad deployments)
The rest of this article details the architecture, systems, practices, and tools that enable this level of reliability.
1: Reliability Engineering Fundamentals
Achieving 99.95% uptime requires a foundation of reliability engineering principles and practices. This section covers the conceptual framework that guides all subsequent architectural and operational decisions.
SRE Principles Applied to a 200+ Microservices Organization
Site Reliability Engineering (SRE), pioneered at Google, is a discipline that treats operations as an engineering problem. Instead of viewing operations as distinct from development, SRE integrates reliability into every stage of system design and operation.
Core SRE principles applicable to this case:
Principle 1: Embrace Risk
Perfect reliability is impossible and economically unwise. A system with 100% uptime SLA requires:
- Redundancy on every component (10x cost multiplier)
- Extreme over-provisioning to handle any failure scenario
- Inability to deploy or change (frozen code path)
Instead, SRE explicitly embraces acceptable risk through error budgets: "If we have a 99.95% uptime SLA, we can afford 26.3 hours of downtime per year. How do we use that budget wisely?"
Principle 2: Prioritize Reliability Over Velocity
When a team is operating within their error budget, they can deploy rapidly and innovate. When they've used their error budget (exceeded downtime quota for the month), they shift focus to reliability: more testing, less feature development, careful deployments.
This creates a natural incentive structure where teams optimize for reliability because it enables velocity.
Principle 3: Automate Toil
SRE teams spend 50% of their time on toil—repetitive, manual tasks (incident response, runbook execution, deployment orchestration, alerting triage).
The other 50% is spent eliminating that toil through:
- Automation of common remediation tasks
- Better tools and dashboards
- Architectural improvements that reduce failure modes
The goal: reduce operational burden and free humans for higher-value work (architecture, capacity planning, incident prevention).
Principle 4: Measure Everything
"What gets measured gets managed." SRE emphasizes quantitative metrics for reliability:
- SLOs and SLIs (Service Level Objectives and Indicators)
- MTBF and MTTR (Mean Time Between Failures and Mean Time To Recovery)
- Error budget consumption
- Alert response time and resolution time
This creates a data-driven culture where reliability decisions are based on evidence, not gut feel.
Error Budgets and SLO/SLI Definitions
An error budget is the inverse of your SLA. If you commit to 99.95% uptime, you have a 0.05% error budget—or 26.3 hours of "allowed" downtime per year.
This budget is not spent haphazardly. Instead, it guides decision-making:
- Within budget: Teams can deploy frequently, experiment, take calculated risks
- Approaching budget: Deployment rate slows, focus shifts to testing and stability
- Exhausted budget: Deployment freeze, all effort on stability improvements
Defining SLIs and SLOs
SLI (Service Level Indicator): A measurable metric indicating whether your service is functioning as expected.
Examples:
- Availability SLI: Percentage of requests that succeeded (non-5xx responses)
- Latency SLI: Percentage of requests completed within 100ms
- Error rate SLI: Percentage of requests without errors or timeouts
SLO (Service Level Objective): A target for your SLI, defining the acceptable level of service.
Examples:
- Availability SLO: 99.95% of requests succeed
- Latency SLO: 99.5% of requests complete within 100ms
- Error rate SLO: <0.1% error rate
Hierarchy:
SLA (Service Level Agreement)
↓
Commits to customer (legal, contractual)
Typically: 99.5%–99.99% depending on service tier
SLO (Service Level Objective)
↓
Internal target (what we aim for)
Typically: 99.95%–99.99% (higher than SLA for buffer)
SLI (Service Level Indicator)
↓
Measurement (how we know we're meeting SLO)
Measured continuously via monitoring
Calculating Acceptable Downtime
For the organization's target of 99.95% uptime, the acceptable downtime is calculated as:
Total minutes/year = 365 days × 24 hours × 60 min = 525,600 minutes
Acceptable downtime = (1 - 99.95%) × 525,600 = 0.0005 × 525,600 = 262.8 minutes
In hours: 262.8 / 60 = 4.38 hours (~4 hours 23 minutes)
Per month: 4.38 / 12 = 0.365 hours (~22 minutes)
Per week: 4.38 / 52 = 0.084 hours (~5 minutes)
This calculation is crucial because it creates discipline around incident management:
- If you have 5 minutes of downtime budget per week and experience a 7-minute outage, you've already exceeded your budget
- This forces critical analysis: did you need to deploy that change? Could you have tested better? Was failover implemented correctly?
Service Dependency Mapping
For 200+ microservices, understanding dependencies is critical. A failure in an upstream service can cascade through the system, causing apparent failures in unrelated downstream services.
The challenge: Capturing and maintaining accurate dependency maps with so many services.
The solution: Automated dependency mapping through distributed tracing and service mesh observability.
Dependency Graph Structure
User Requests
↓
API Gateway
├→ Authentication Service
│ └→ Identity Provider
├→ Product Service
│ ├→ Inventory Service
│ │ └→ PostgreSQL (primary DB)
│ │ └→ Redis Cache
│ └→ Pricing Service
│ └→ Configuration Service
├→ Order Service
│ ├→ Product Service (see above)
│ ├→ Payment Service
│ │ └→ Payment Provider API
│ ├→ Notification Service
│ │ └→ Message Queue (Kafka)
│ └→ Order Database
│ └→ PostgreSQL (primary DB)
└→ Recommendation Service
└→ ML Model Service
└→ Model Store (S3)
This dependency graph reveals:
- Critical path services: Failures in API Gateway, Auth, or Order Service cascade to many downstream services
- Single points of failure: If PostgreSQL primary goes down, multiple services are affected
- Circular dependencies: Sometimes exist (e.g., Service A calls B, B calls A indirectly), which can cause cascading failures
Automated Dependency Discovery
Using a service mesh (Istio) with distributed tracing, dependencies are automatically discovered:
# Query Jaeger or Kiali (Istio dashboard) to get dependency graph
# This shows all services that call each other in production
# Example: Get all services that call "order-service"
upstream_services = jaeger_client.query_services(called_by="order-service")
# Returns: [api-gateway, recommendation-service, notification-service]
# Get all services that order-service calls
downstream_services = jaeger_client.query_services(calls="order-service")
# Returns: [payment-service, product-service, inventory-service, order-db]
pythonSingle Points of Failure Identification
SPOF (Single Point of Failure) analysis is the process of identifying components whose failure would cause system-wide outage.
Common SPOFs in microservices architectures:
- Primary database (single master, no replica failover)
- API gateway (single instance handling all traffic)
- Message queue (single cluster, no replication)
- Cache layer (single Redis instance, no sentinel)
- Configuration service (centralized, no fallback)
- DNS (misconfigured or single provider)
- Load balancer (single NLB with no redundancy)
Identification methodology:
For each critical service:
- Trace the dependency path from user to service and back
- Identify each component in the path
- Ask: If this component fails, will the entire system fail?
- If yes: It's a SPOF and needs redundancy
Remediation for identified SPOFs:
| SPOF | Solution | Implementation |
|---|---|---|
| Primary Database | Multi-master replication or hot standby | PostgreSQL HA with Patroni; automated failover |
| API Gateway | Multiple instances behind load balancer | Terraform + ALB/NLB across AZs |
| Message Queue | Cluster with replication | Kafka with 3+ brokers, replication factor 3 |
| Cache (Redis) | Redis Sentinel or Redis Cluster | Sentinel for failover; Cluster for sharding |
| Config Service | Multiple replicas with eventual consistency | Consul or etcd with 3+ nodes |
| DNS | Multiple DNS providers | Route53 + Cloudflare or similar |
| Load Balancer | Multi-region failover | AWS ALB/NLB across regions |
Service Dependency and Reliability Table
| Service | Role | Dependencies | Failure Impact | Criticality | SPOF Status |
|---|---|---|---|---|---|
| API Gateway | Entry point for all traffic | Load balancer, Auth service | All user requests fail | CRITICAL | YES - requires redundancy |
| Auth Service | User authentication | Identity Provider, Token cache | Auth failures cascade | CRITICAL | Partial (depends on ID provider) |
| Order Service | Process orders | Payment Service, Inventory, DB | Orders cannot be placed | CRITICAL | Depends on DB failover |
| Product Service | Retrieve product data | Inventory Service, Cache | Browse/search broken | HIGH | Cache failover needed |
| Inventory Service | Track stock levels | Product DB | Stock visibility broken | HIGH | DB failover critical |
| Notification Service | Send emails, SMS | Message Queue, SMTP provider | Delayed notifications | MEDIUM | Depends on queue clustering |
| Recommendation Service | ML-based recommendations | ML Model Service, Cache | Recommendations unavailable | LOW | Can degrade gracefully |
| Payment Service | Process payments | Payment Provider API, Payment DB | Transactions fail | CRITICAL | Retry logic + DB failover |
2: Architecture for Resilience
With reliability principles established, we turn to architecture design that enables 99.95% uptime across 200+ microservices.
2.1: High Availability Design Patterns
High availability (HA) means designing systems to be continuously operational, even when individual components fail.
Multi-AZ Deployment Architecture
The first line of defense against downtime is geographic redundancy within a region.
AWS provides Availability Zones (AZs): isolated datacenters within a region with independent power, networking, and cooling. When properly architected, failure of one AZ should not affect your application.
Multi-AZ architecture principles:
Region (us-east-1)
├─ AZ 1 (us-east-1a)
│ ├─ Kubernetes Node 1 (m5.2xlarge)
│ ├─ Kubernetes Node 2 (m5.2xlarge)
│ ├─ RDS Primary Database
│ ├─ ElastiCache Redis Primary
│ └─ NAT Gateway
├─ AZ 2 (us-east-1b)
│ ├─ Kubernetes Node 3 (m5.2xlarge)
│ ├─ Kubernetes Node 4 (m5.2xlarge)
│ ├─ RDS Standby Database (synchronous replication)
│ ├─ ElastiCache Redis Replica
│ └─ NAT Gateway
└─ AZ 3 (us-east-1c)
├─ Kubernetes Node 5 (m5.2xlarge)
├─ Kubernetes Node 6 (m5.2xlarge)
└─ (Services spread across AZs)
Key requirements:
- No single point of failure per AZ (each service has replicas across AZs)
- Synchronous replication for data (writes must be replicated before acknowledged)
- Load balancer spans all AZs (distributes traffic evenly)
- Stateless services where possible (any server can handle any request)
Cost of multi-AZ: ~2–3x infrastructure cost (due to redundancy), but worth it for critical applications.
Database Replication and Failover Strategies
Databases are often the most critical infrastructure component and the most complex to keep highly available.
Replication topologies:
1. Primary-Standby (Master-Slave) Replication
Primary DB (us-east-1a) [writes + reads]
↓ (async replication)
Standby DB (us-east-1b) [reads only]
↓ (promotion on failure)
Promoted Primary (us-east-1b) [new writes]
Characteristics:
- RPO (Recovery Point Objective): ~1–5 seconds (async replication lag)
- RTO (Recovery Time Objective): ~30–60 seconds (failover + DNS propagation)
- Asymmetric: Primary handles writes; standby handles reads
- Failover must be triggered (manual or automated)
2. Multi-Master (Active-Active) Replication
Primary DB 1 (us-east-1a) [writes + reads]
↕ (bidirectional replication)
Primary DB 2 (us-east-1b) [writes + reads]
Characteristics:
- RPO: ~1–10 seconds (asynchronous or quorum-based)
- RTO: ~0 seconds (automatic failover, no switchover needed)
- Both nodes can accept writes (risk of conflicts)
- Higher complexity (conflict resolution needed)
3. Quorum-Based Replication (PostgreSQL with Patroni)
PostgreSQL Primary (us-east-1a)
↓
PostgreSQL Standby 1 (us-east-1b)
↓
PostgreSQL Standby 2 (us-east-1c)
Patroni (distributed consensus)
├─ Monitors all replicas
├─ Detects primary failure
└─ Promotes best standby automatically
For the organization, PostgreSQL HA with Patroni was chosen:
- RPO: ~0.5–5 seconds (synchronous replication with quorum)
- RTO: ~10–30 seconds (automated failover)
- Automatic failover (no manual intervention needed)
- Consistent read-after-write semantics
Load Balancing and Health Check Design
Load balancers distribute traffic across multiple servers and remove unhealthy servers from the pool.
Critical for HA: Health checks must be accurate and fast.
Bad health check: Checks only if TCP port is open (server could be hung but port is open)
Good health check: Sends actual HTTP request, expects specific response
// Good health check implementation
func healthHandler(w http.ResponseWriter, r *http.Request) {
// Check critical dependencies
checks := map[string]error{
"database": healthCheckDB(),
"cache": healthCheckRedis(),
"queue": healthCheckKafka(),
}
allHealthy := true
statusCode := http.StatusOK
for name, err := range checks {
if err != nil {
allHealthy = false
statusCode = http.StatusServiceUnavailable
log.Printf("Health check failed for %s: %v", name, err)
}
}
w.WriteHeader(statusCode)
json.NewEncoder(w).Encode(map[string]interface{}{
"healthy": allHealthy,
"checks": checks,
})
}
goLoad balancer configuration (ALB):
# Terraform for ALB health checks
resource "aws_lb_target_group" "api_servers" {
name = "api-servers-tg"
port = 8080
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5 # seconds
interval = 10 # seconds, how often to check
path = "/health"
matcher = "200" # expect 200 OK
}
deregistration_delay = 30 # graceful shutdown period
}
yamlCircuit Breaker Implementation with Istio
A circuit breaker is a pattern that stops sending traffic to a failing service, allowing it to recover without being overwhelmed.
States:
- Closed: Normal operation, requests flow through
- Open: Too many errors, requests fail immediately (fast failure)
- Half-open: Test if service has recovered, allow some requests through
Istio implementation:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service-circuit-breaker
namespace: production
spec:
host: order-service.production.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100 # max concurrent connections
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 2
outlierDetection:
consecutive5xxErrors: 5 # open circuit after 5 consecutive 5xx
interval: 30s # check interval
baseEjectionTime: 30s # how long to keep service out
maxEjectionPercent: 50 # max % of hosts that can be ejected
minRequestVolume: 5 # min requests before ejection
yamlWhen the circuit breaker opens, clients get fast failures (immediately, no timeout waiting) and can fallback to alternate behavior:
# Client-side fallback
def get_order_details(order_id):
try:
return fetch_from_order_service(order_id, timeout=5)
except CircuitBreakerOpen:
# Service is failing, return cached data or degraded response
return get_cached_order_details(order_id) or \
{"id": order_id, "cached": True, "details_incomplete": True}
pythonBulkhead Pattern for Resource Isolation
The bulkhead pattern isolates critical resources to prevent a failure in one service from affecting others.
Example: Separate thread pools for different request types
// Without bulkheads: If order service has slow database queries,
// its thread pool fills up and blocks unrelated requests
// With bulkheads: Separate pools for different request types
ThreadPool criticalPool = new ThreadPool(50); // Orders, payments
ThreadPool backgroundPool = new ThreadPool(20); // Analytics, recommendations
ThreadPool internalPool = new ThreadPool(10); // Internal admin requests
// Route requests to appropriate pool
if (request.isCritical()) {
return criticalPool.execute(request);
} else if (request.isBackground()) {
return backgroundPool.execute(request);
} else {
return internalPool.execute(request);
}
javaIn Kubernetes, bulkheads are implemented via:
- Namespaces: Separate pods by function or team
- Resource quotas: Limit CPU/memory per namespace
- Pod priority: Critical pods get resources first
- Network policies: Restrict traffic between services
2.2: Kubernetes Reliability Features
For the organization running 200+ microservices on Kubernetes, leveraging Kubernetes's built-in reliability features was essential.
Pod Disruption Budgets (PDB)
Kubernetes frequently needs to voluntarily disrupt pods for:
- Node maintenance (OS updates, security patches)
- Node scaling (removing underutilized nodes)
- Pod evictions (enforcing resource quotas)
Pod Disruption Budget defines how many pod replicas can be disrupted simultaneously without violating availability guarantees.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: order-service-pdb
namespace: production
spec:
minAvailable: 2 # at least 2 order-service pods must be running
selector:
matchLabels:
app: order-service
unhealthyPodEvictionPolicy: AlwaysAllow
# Alternative: specify maxUnavailable
# maxUnavailable: 1 # at most 1 replica can be disrupted at a time
yamlHow PDBs work:
- Before evicting a pod, Kubernetes checks the associated PDB
- If evicting would violate minAvailable, Kubernetes postpones the eviction
- If evicting is allowed, pod is gracefully shut down (terminationGracePeriod window)
Example scenario:
- Order service has 3 replicas, PDB minAvailable=2
- Node needs maintenance
- Kubernetes can evict only 1 pod (leaving 2 available)
- Other 2 pods stay running, serving requests
- Service remains available with reduced capacity
Liveness and Readiness Probes: Tuning for Reliability
Kubernetes uses health checks to know when a pod is healthy and ready for traffic.
Two probe types:
Liveness Probe: Is the pod alive? If not, restart it.
apiVersion: v1
kind: Pod
metadata:
name: order-service-pod
spec:
containers:
- name: order-service
image: order-service:v1.2.3
# Liveness: restart if unhealthy
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30 # wait 30s before first check
periodSeconds: 10 # check every 10 seconds
timeoutSeconds: 5 # wait 5s for response
failureThreshold: 3 # restart after 3 consecutive failures
yamlReadiness Probe: Is the pod ready to accept traffic?
# Readiness: remove from load balancer if not ready
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5 # fast readiness check
periodSeconds: 5 # check frequently
timeoutSeconds: 3
failureThreshold: 2 # remove from load balancer after 2 failures
yamlCritical difference:
- Liveness failure → pod is killed and restarted
- Readiness failure → pod remains running but traffic is removed
Tuning considerations for 99.95% uptime:
- initialDelaySeconds: Too short causes false failures on slow startups; too long delays detection. Set to 50–100% of typical startup time.
- periodSeconds: Too long delays failure detection; too short creates probe traffic overhead. Typically 5–10 seconds.
- failureThreshold: Lower threshold (2–3) reacts faster to failures; higher threshold (5+) tolerates transient hiccups. For 99.95%, recommend 2–3.
Bad example (causes false positives):
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 1 # WAY too aggressive, creates probe traffic storm
failureThreshold: 1 # WAY too aggressive, restarts on single false positive
yamlResource Requests and Limits Optimization
Kubernetes schedules pods based on resource requests (guaranteed allocation) and limits (maximum).
Requests must be accurate for proper scheduling.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 10
template:
spec:
containers:
- name: order-service
image: order-service:v1.2.3
resources:
requests:
cpu: "500m" # 0.5 CPU cores guaranteed
memory: "512Mi" # 512 MB guaranteed
limits:
cpu: "1000m" # burst up to 1 CPU
memory: "1Gi" # burst up to 1 GB
yamlConsequences of incorrect requests:
If requests are too low:
- Kubernetes thinks pod needs little resources
- Schedules too many pods on single node
- Node becomes CPU/memory constrained
- All pods on that node slow down (noisy neighbor problem)
- Service becomes unreliable due to starvation
If requests are too high:
- Kubernetes thinks pod needs lots of resources
- Schedules fewer pods per node
- Many nodes underutilized
- Higher infrastructure cost
- Less fault tolerance (fewer replicas fit in cluster)
Right-sizing methodology:
- Deploy service and let it run for 1 week
- Collect CPU/memory metrics (p95 utilization)
- Set request = p95 actual usage + 20% buffer
- Monitor for OOMKilled events (memory too low) or throttling (CPU too low)
Anti-Affinity Rules for Pod Distribution
Anti-affinity rules ensure pods are spread across nodes and AZs, preventing all replicas from being on the same failure domain.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 10
template:
spec:
affinity:
podAntiAffinity:
# Try hard to spread pods across nodes (preferred)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- order-service
topologyKey: kubernetes.io/hostname
# MUST spread pods across AZs (required)
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- order-service
topologyKey: topology.kubernetes.io/zone
containers:
- name: order-service
image: order-service:v1.2.3
# ... rest of config
yamlTopology keys explained:
kubernetes.io/hostname: Spread across nodes (host-level failure)topology.kubernetes.io/zone: Spread across AZs (AZ-level failure)topology.kubernetes.io/region: Spread across regions (region-level failure)
Comprehensive Kubernetes Deployment with All Reliability Features
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
labels:
app: order-service
team: platform
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # max 2 extra pods during update
maxUnavailable: 0 # zero pods unavailable during update
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
version: v1.2.3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: order-service
securityContext:
runAsNonRoot: true
fsReadOnlyRootFilesystem: true
# Pod anti-affinity for high availability
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- order-service
topologyKey: topology.kubernetes.io/zone
# Init container for migrations (before main container starts)
initContainers:
- name: db-migration
image: order-service-migrate:v1.2.3
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: order-service-secrets
key: database-url
# Main application container
containers:
- name: order-service
image: order-service:v1.2.3
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
# Environment configuration
env:
- name: LOG_LEVEL
value: "info"
- name: DATABASE_POOL_SIZE
value: "20"
- name: CACHE_TTL
value: "3600"
# Secrets
envFrom:
- secretRef:
name: order-service-secrets
- configMapRef:
name: order-service-config
# Resource requests (for scheduling)
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
# Startup probe (for slow-starting apps)
startupProbe:
httpGet:
path: /health/startup
port: http
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # fail after 150 seconds
# Liveness probe (restart if unhealthy)
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe (remove from load balancer if not ready)
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
# Graceful shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Allow in-flight requests to complete
# Volume mounts
volumeMounts:
- name: config
mountPath: /etc/order-service/config
readOnly: true
- name: tmp
mountPath: /tmp
# Security
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
# Volumes
volumes:
- name: config
configMap:
name: order-service-config
- name: tmp
emptyDir: {}
# Graceful termination period
terminationGracePeriodSeconds: 30
# DNS policy
dnsPolicy: ClusterFirst
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: order-service-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: order-service
---
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: production
labels:
app: order-service
spec:
type: ClusterIP
selector:
app: order-service
ports:
- name: http
port: 80
targetPort: http
protocol: TCP
- name: metrics
port: 9090
targetPort: metrics
protocol: TCP
sessionAffinity: None
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
yaml2.3: Stateful Service Reliability
While Kubernetes excels at managing stateless services, stateful services (databases, caches, message queues) require special attention.
PostgreSQL HA with Patroni
PostgreSQL is the organization's primary relational database. Patroni provides automatic failover and replicas management for high availability.
Architecture:
PostgreSQL Primary (postgres-0)
├─ Synchronous replica 1 (postgres-1)
├─ Synchronous replica 2 (postgres-2)
└─ Patroni cluster manager
├─ Distributed consensus (etcd)
├─ Automatic failover
└─ Replica management
How Patroni works:
- Primary writes to replicas synchronously
- Patroni monitors primary health via distributed consensus (etcd)
- If primary becomes unresponsive (no heartbeat), Patroni automatically promotes best replica
- Newly promoted primary starts accepting writes
- Old primary (if it recovers) rejoins as replica
StatefulSet configuration:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: databases
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgres
topologyKey: topology.kubernetes.io/zone
serviceAccountName: postgres
containers:
- name: postgres
image: postgres:15
ports:
- name: postgresql
containerPort: 5432
env:
- name: POSTGRES_DB
value: production
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U postgres
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U postgres
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: postgresql-storage
mountPath: /var/lib/postgresql/data
- name: patroni
image: patroni:latest
env:
- name: ETCD_HOSTS
value: "etcd-0.etcd.databases:2379,etcd-1.etcd.databases:2379,etcd-2.etcd.databases:2379"
- name: PATRONI_POSTGRESQL_PARAMETERS
value: "synchronous_commit=on" # Synchronous replication
ports:
- name: patroni
containerPort: 8008
volumes:
- name: postgresql-storage
emptyDir: {}
volumeClaimTemplates:
- metadata:
name: postgresql-storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
yamlKey parameters for 99.95% uptime:
| Parameter | Value | Rationale |
|---|---|---|
| Replicas | 3 | Can tolerate 1 failure; quorum requires 2/3 |
| Synchronous replicas | 2 | Writes wait for 2 replicas; RPO = 0 |
| Failover timeout | 10–15 seconds | Auto-promotes best replica |
| Health check interval | 5 seconds | Detects failure within 5 seconds |
| Replication lag monitoring | <1 second | Triggers alert if replicas fall behind |
Redis Sentinel for Cache High Availability
Redis caches frequently accessed data. While Redis failure doesn't corrupt data, it causes:
- Increased database load (no cache)
- Degraded performance
- Potential cascading failures
Redis Sentinel provides automatic failover for Redis clusters.
Architecture:
Redis Master (redis-master)
├─ Replica 1 (redis-replica-1)
├─ Replica 2 (redis-replica-2)
└─ Sentinel cluster (3 sentinels)
├─ Monitors master health
├─ Detects failure
└─ Promotes best replica automatically
Helm deployment:
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: redis-sentinel
namespace: databases
spec:
repo: https://charts.bitnami.com/bitnami
chart: redis
version: "17.x"
values:
architecture: "replication"
replica:
replicaCount: 2
sentinel:
enabled: true
quorum: 2
downAfterMilliseconds: 5000
failoverTimeout: 10000
persistence:
enabled: true
size: "50Gi"
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "2Gi"
yamlClient connection (automatic failover):
from redis.sentinel import Sentinel
# Sentinel automatically routes to current master
sentinel = Sentinel([('redis-sentinel-0', 26379),
('redis-sentinel-1', 26379),
('redis-sentinel-2', 26379)])
# Automatically fails over if master goes down
redis_master = sentinel.master_for('mymaster', socket_timeout=0.1)
# Use like normal Redis
redis_master.set('key', 'value')
value = redis_master.get('key')
pythonMessage Queue Clustering (Kafka/RabbitMQ)
Message queues must be highly available to avoid breaking asynchronous workflows.
Kafka cluster (recommended for this architecture):
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-config
namespace: queues
data:
server.properties: |
broker.id=0
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://kafka-0.kafka.queues.svc.cluster.local:9092
log.dirs=/var/lib/kafka-logs
num.network.threads=8
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
# Replication
default.replication.factor=3
min.insync.replicas=2 # Wait for 2 replicas before acking write
# Retention
log.retention.hours=168
log.segment.bytes=1073741824
# ZooKeeper
zookeeper.connect=zookeeper-0.zookeeper.queues.svc.cluster.local:2181,zookeeper-1.zookeeper.queues.svc.cluster.local:2181,zookeeper-2.zookeeper.queues.svc.cluster.local:2181
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
namespace: queues
spec:
serviceName: kafka
replicas: 3
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- kafka
topologyKey: topology.kubernetes.io/zone
containers:
- name: kafka
image: confluentinc/cp-kafka:7.4.0
ports:
- name: plaintext
containerPort: 9092
env:
- name: KAFKA_BROKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KAFKA_ADVERTISED_LISTENERS
value: "PLAINTEXT://$(HOSTNAME).kafka.queues.svc.cluster.local:9092"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumeMounts:
- name: datadir
mountPath: /var/lib/kafka-logs
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
yamlKafka producer (at-least-once delivery):
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['kafka-0.kafka.queues:9092',
'kafka-1.kafka.queues:9092',
'kafka-2.kafka.queues:9092'],
acks='all', # Wait for all in-sync replicas
retries=3,
compression_type='snappy'
)
# Publish with callback
def on_send_success(record_metadata):
print(f"Sent to {record_metadata.topic} partition {record_metadata.partition}")
def on_send_error(exc):
print(f"Error: {exc}")
future = producer.send('orders', {'order_id': 123, 'amount': 99.99})
future.add_callback(on_send_success)
future.add_errback(on_send_error)
producer.flush()
python3: Observability Stack – Seeing Everything at Scale
Achieving 99.95% uptime requires complete visibility into the system. You cannot fix what you cannot see. This section covers the observability stack that enables visibility across 200+ microservices.
3.1: Metrics Collection & Analysis with Prometheus
Prometheus is a time-series database and monitoring system designed for Kubernetes and microservices.
Prometheus Architecture for 200+ Services
Scrape Targets (200+ services)
↓
Prometheus Server (time-series DB)
├─ Scrapes metrics every 15-30 seconds
├─ Stores in local TSDB
├─ Retains 15-30 days of data
└─ Evaluates alerting rules
↓
PagerDuty/Slack (firing alerts)
Remote Storage (optional)
↓ (long-term retention)
S3/GCS (Thanos, Cortex, etc.)
Scale considerations:
- 200 services × 100 metrics per service = 20,000 metric time series
- At 15-second scrape interval: ~86,400 samples per second
- Storage: ~50-100 GB per 15 days of retention
For reliability:
- Multiple Prometheus replicas (3+) scraping same targets
- Remote storage for long-term retention
- Thanos for multi-replica deduplication and long-term querying
Prometheus federation (hierarchical scraping for large scale):
# Leaf Prometheus instances (one per team/product)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-leaf-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
external_labels:
cluster: us-east-1
team: platform
scrape_configs:
# Scrape services owned by platform team
- job_name: 'platform-services'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- platform
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_scrape]
action: keep
regex: 'true'
---
# Central Prometheus (scrapes from leaf instances)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-central-config
data:
prometheus.yml: |
global:
scrape_interval: 30s
external_labels:
cluster: global
scrape_configs:
# Scrape from leaf Prometheus instances (federation)
- job_name: 'leaf-prometheus'
static_configs:
- targets:
- 'prometheus-platform.us-east-1:9090'
- 'prometheus-products.us-east-1:9090'
- 'prometheus-data.us-east-1:9090'
- 'prometheus-infrastructure.us-east-1:9090'
yamlService-Level Metrics: RED Method
RED method defines three metrics for every service:
- Rate: Requests per second
- Errors: Error rate (failed requests)
- Duration: Request latency
These three metrics cover most reliability scenarios.
Instrumentation example in Python:
from prometheus_client import Counter, Histogram, Gauge
import time
# RED metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_latency = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
# Helper decorator
def track_metrics(func):
def wrapper(*args, **kwargs):
start = time.time()
try:
result = func(*args, **kwargs)
status = 200
return result
except Exception as e:
status = 500
raise
finally:
duration = time.time() - start
endpoint = func.__name__
request_latency.labels(method='POST', endpoint=endpoint).observe(duration)
request_count.labels(method='POST', endpoint=endpoint, status=status).inc()
return wrapper
@track_metrics
def create_order(order_data):
# Business logic
return process_order(order_data)
pythonPrometheus queries (PromQL) for RED metrics:
# Rate (requests per second)
rate(http_requests_total{job="order-service"}[5m])
# Error rate (errors as % of total)
rate(http_requests_total{job="order-service",status=~"5.."}[5m]) /
rate(http_requests_total{job="order-service"}[5m])
# Latency (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Combined SLO check: availability (error rate < 0.1%)
(
1 - (
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
)
) > 0.999
promqlInfrastructure Metrics: USE Method
USE method tracks:
- Utilization: % of time resource is in use (0-100%)
- Saturation: Queue depth/tasks waiting
- Errors: Errors encountered
Key infrastructure metrics:
# CPU utilization
container_cpu_usage_seconds_total
container_cpu_throttling_seconds_total # CPU throttling (oversubscribed)
# Memory utilization
container_memory_usage_bytes
container_memory_max_usage_bytes
# Disk usage
node_filesystem_avail_bytes # Available disk space
container_fs_usage_bytes # Container disk usage
# Network utilization
container_network_receive_bytes_total
container_network_transmit_bytes_total
# Disk I/O
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total
# Error metrics
node_disk_io_errs_total
node_network_receive_errs_total
yamlPromQL for USE method monitoring:
# CPU utilization (should be <70% for safety)
rate(container_cpu_usage_seconds_total[5m]) * 100
# Memory utilization (should be <80%)
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
# Disk utilization (alert at >85%)
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100
# Network saturation (dropped packets indicate saturation)
rate(node_network_receive_drop_total[5m])
promqlCustom Application Metrics
Beyond RED and USE, applications expose domain-specific metrics.
from prometheus_client import Gauge, Counter
# Domain-specific metrics for order service
orders_in_progress = Gauge(
'orders_in_progress',
'Number of orders currently being processed'
)
orders_completed_total = Counter(
'orders_completed_total',
'Total orders completed',
['status'] # 'success', 'cancelled'
)
payment_processing_time = Histogram(
'payment_processing_seconds',
'Time spent processing payments'
)
inventory_stock_level = Gauge(
'inventory_stock_level',
'Current stock level',
['product_id']
)
class OrderService:
def process_order(self, order_id):
orders_in_progress.inc()
try:
# Process payment
start = time.time()
payment_result = self.process_payment(order_id)
payment_processing_time.observe(time.time() - start)
# Update inventory
self.update_inventory(order_id)
orders_completed_total.labels(status='success').inc()
except Exception as e:
orders_completed_total.labels(status='failed').inc()
raise
finally:
orders_in_progress.dec()
pythonMetric Cardinality Management at Scale
A high-cardinality metric has many unique label value combinations, consuming excessive memory and disk space.
Bad example:
# ANTI-PATTERN: Using user ID as metric label
user_latency = Histogram(
'request_latency_seconds',
'Request latency per user',
['user_id'] # Could be millions of unique values!
)
# This creates millions of time series (one per user)
# Memory and disk explode
pythonGood practice:
# Use bounded labels (few unique values)
request_latency = Histogram(
'request_latency_seconds',
'Request latency',
['service', 'endpoint', 'status_code'], # Limited cardinality
buckets=[...]
)
# For per-user metrics, use a separate index/database
# Query user metrics separately if needed
pythonCardinality monitoring:
# Alert when metric cardinality is too high
- alert: PrometheusHighMetricCardinality
expr: count(count by (__name__) (up) > 10000) > 50
for: 5m
annotations:
summary: "Prometheus instance has {{ $value }} high-cardinality metrics"
yaml3.2: Centralized Logging with ELK Stack
While metrics answer "WHAT is happening?", logs answer "WHY?"
A single error might show up as a failed request metric, but logs reveal the root cause.
ELK Stack Architecture
Application Logs
↓
Filebeat (log shipper)
↓
Logstash (log processor/enricher)
↓
Elasticsearch (full-text search database)
↓
Kibana (visualization/query UI)
For 200+ services:
- Each service writes logs to stdout (12-factor app)
- Kubelet captures container logs
- Filebeat reads logs and ships to Logstash
- Logstash parses, enriches, filters logs
- Elasticsearch stores for search
- Kibana provides UI for searching and analyzing
Log volume at scale:
- 200 services × 100-1000 log lines per second = 20K-200K logs/sec
- At 500 bytes per log = 10-100 MB/sec ingestion rate
- Per day: ~900 GB to 9 TB
Elasticsearch cluster sizing:
- 10-30 data nodes (depending on retention)
- Replication factor 2 (redundancy)
- Retention: 7-30 days (older logs archival to S3)
Structured Logging Standards
Unstructured logs ("Something went wrong") are hard to search and analyze. Structured logs (JSON with fields) are queryable.
Example: Before (unstructured)
2025-12-10T12:34:56Z order-service: Error processing order 12345 for user john@example.com: database connection timeout
Hard to query: "How many order failures per user?"
Example: After (structured)
{
"timestamp": "2025-12-10T12:34:56Z",
"service": "order-service",
"level": "ERROR",
"request_id": "req-abc123def456",
"order_id": 12345,
"user_id": "user-789",
"error_type": "database_timeout",
"error_message": "database connection timeout after 5s",
"duration_ms": 5234,
"metadata": {
"region": "us-east-1",
"pod": "order-service-pod-5",
"node": "k8s-node-12"
}
}
jsonNow easily queryable.
Structured logging in Python:
import json
import logging
import sys
from pythonjsonlogger import jsonlogger
# Configure JSON logging
logHandler = logging.StreamHandler(sys.stdout)
formatter = jsonlogger.JsonFormatter(
'%(timestamp)s %(level)s %(name)s %(message)s %(request_id)s %(order_id)s'
)
logHandler.setFormatter(formatter)
logger = logging.getLogger()
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)
# Log with structured fields
logger.info(
"Order processed",
extra={
'timestamp': '2025-12-10T12:34:56Z',
'request_id': 'req-abc123',
'order_id': 12345,
'user_id': 'user-789',
'duration_ms': 245,
'status': 'success'
}
)
pythonLog Retention and Cost Optimization
Storage is the largest cost component of ELK. Optimize:
1. Log sampling: For high-volume services, log only a percentage of requests.
import random
def should_log(request_id):
"""Log 10% of requests in production, 100% in dev."""
if env == 'dev':
return True
# Hash-based sampling for consistent trace context
hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
return (hash_val % 100) < 10 # 10% sample
if should_log(request_id):
logger.info("Request details", extra=log_data)
python2. Log retention tiers:
# Recent logs (7 days): Hot storage (fast, expensive)
# Warm logs (8-30 days): Warm storage (slower, cheaper)
# Archive (31+ days): S3 (very cheap)
elasticsearch.yml:
index.lifecycle.name: order-service-logs
index.lifecycle.rollover_alias: order-service-logs-write
ilm_policy:
phases:
hot:
min_age: 0d
actions:
rollover:
max_primary_shard_size: 50GB
warm:
min_age: 7d
actions:
set_priority:
priority: 50
forcemerge:
max_num_segments: 1
cold:
min_age: 30d
actions:
searchable_snapshot:
snapshot_repository: s3-repository
delete:
min_age: 90d
actions:
delete: {}
yaml3. Field filtering: Don't log everything.
# ANTI-PATTERN: Logs credit card numbers
logger.info(f"Payment: {credit_card_number}")
# GOOD: Log hashed or masked values
logger.info(f"Payment: {credit_card_last_4}")
pythonSearch and Analysis Patterns
Common queries on logs for reliability:
Query 1: Errors by service (last hour)
GET logs-*/_search
{
"query": {
"bool": {
"must": [
{"match": {"level": "ERROR"}},
{"range": {"timestamp": {"gte": "now-1h"}}}
]
}
},
"aggs": {
"by_service": {
"terms": {"field": "service", "size": 50}
}
}
}
jsonQuery 2: Slow requests (p99 latency)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
promqlQuery 3: Trace all requests for a user during an incident
GET logs-*/_search
{
"query": {
"match": {"user_id": "user-123"}
},
"sort": [{"timestamp": {"order": "asc"}}]
}
json3.3: Distributed Tracing with Jaeger
With 200+ microservices, a single user request touches many services. Distributed tracing shows the entire path through all services.
Example trace:
User Request (5000ms total)
├─ API Gateway (100ms)
├─ Auth Service (150ms)
├─ Order Service (800ms)
│ ├─ Validate Order (100ms)
│ ├─ Product Service (250ms) ← slow!
│ │ └─ Database Query (200ms)
│ ├─ Inventory Service (150ms)
│ └─ Payment Service (300ms)
│ └─ External Payment API (280ms) ← very slow!
└─ Notification Service (200ms)
Distributed tracing shows exactly where latency comes from.
Jaeger Implementation
from jaeger_client import Config
import logging
def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const', # Sample 100% (can reduce in prod)
'param': 1,
},
'logging': True,
'local_agent': {
'reporting_host': 'jaeger-agent.monitoring',
'reporting_port': 6831,
},
},
service_name=service_name,
validate=True,
)
return config.initialize_tracer()
tracer = init_tracer('order-service')
# Instrument a function
def process_order(order_id):
with tracer.start_active_span('process_order') as scope:
span = scope.span
span.set_tag('order_id', order_id)
# Nested spans
with tracer.start_active_span('validate_order') as nested:
validate_order(order_id)
with tracer.start_active_span('call_payment_service') as nested:
payment_result = call_payment_service(order_id)
nested.span.set_tag('payment_status', payment_result['status'])
return {'success': True}
pythonTrace sampling strategy:
# Sample traces intelligently
class ProbabilisticSampler:
def __init__(self, initial_rate=0.001):
self.rate = initial_rate
def should_sample(self, trace_id, error_occurred=False):
if error_occurred:
return True # Always sample errors
if is_slow_request(trace_id):
return True # Always sample slow requests
# Sample normal requests at low rate
return random.random() < self.rate
def adjust_rate(self, error_rate):
# If error rate is high, increase sampling
if error_rate > 0.01:
self.rate = min(0.1, self.rate * 2)
else:
self.rate = max(0.001, self.rate / 2)
pythonFinding Performance Bottlenecks
Jaeger dashboards show:
- Critical path: Longest span in trace (the bottleneck)
- Span dependency: Which services call which
- Latency heatmap: Distribution of request latencies
3.4: Alerting Strategy – Right Sizing Alerts
With proper monitoring, the next step is alerting. But bad alerting creates alert fatigue.
Alert Fatigue Prevention
Alert fatigue occurs when:
- Too many alerts fire → engineers ignore them → critical alerts missed
- Alerts are flaky (fire on transient issues) → engineers lose trust
- Alerts require manual investigation (no context) → slow MTTR
Consequences of alert fatigue:
- 60% of alerts are acknowledged but not acted on
- 70% of ignored alerts were not actionable
- Alert fatigue correlates with worse incident response
Tiered Alerting System (P0–P4)
Not all alerts are equal. Severity tiers ensure critical issues get attention while non-critical issues don't create noise.
| Severity | Examples | Action | Escalation |
|---|---|---|---|
| P0 (Critical) | Service completely unavailable, production data loss, SLA breach | Page on-call immediately (wake up) | Escalate to on-call manager after 10 min |
| P1 (Urgent) | Service degraded >50%, SLO breached, high error rate | Page on-call, respond within 15 min | Escalate if not acknowledged in 10 min |
| P2 (High) | Service degraded 10-50%, elevated error rate, elevated latency | Page on-call, respond within 30 min | Create ticket if not acknowledged in 20 min |
| P3 (Medium) | Elevated but within SLO, resource warning, non-critical path affected | Send Slack notification, investigate when available | Create ticket |
| P4 (Low) | Informational, suggestions for improvement, FYI | Log to dashboards, review during office hours | No immediate action required |
Example alert configuration:
groups:
- name: order-service-alerts
rules:
# P0: Complete outage
- alert: OrderServiceDown
expr: up{job="order-service"} == 0
for: 1m # Alert after 1 minute (quick response)
severity: P0
annotations:
summary: "Order service is completely down"
description: "No order-service instances are responding"
runbook: "https://wiki.company.com/runbooks/order-service-down"
# P1: High error rate
- alert: OrderServiceHighErrorRate
expr: |
rate(http_requests_total{job="order-service",status=~"5.."}[5m]) /
rate(http_requests_total{job="order-service"}[5m]) > 0.05
for: 3m # Allow transient spikes
severity: P1
annotations:
summary: "Order service error rate >5%"
description: "Error rate: {{ $value | humanizePercentage }}"
# P2: Elevated latency
- alert: OrderServiceHighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="order-service"}[5m])) > 1.0
for: 5m
severity: P2
annotations:
summary: "Order service p95 latency >1s"
# P3: Database connection pool exhausted
- alert: OrderServiceDBPoolExhausted
expr: |
pg_stat_activity_count{database="orders",state="active"} /
pg_settings_max_connections > 0.8
for: 10m
severity: P3
annotations:
summary: "Order service database connection pool 80% exhausted"
# P4: Informational
- alert: OrderServiceHighMemory
expr: |
container_memory_usage_bytes{pod=~"order-service.*"} /
container_spec_memory_limit_bytes > 0.7
for: 15m
severity: P4
annotations:
summary: "Order service memory usage >70%"
yamlOn-Call Rotation and Escalation
Proper on-call management:
- Rotation: Fair distribution of on-call duties
- Escalation: If primary on-call doesn't acknowledge in X minutes, escalate to manager
- Runbooks: Clear procedures for common alerts
- Blameless culture: Focus on fixing problems, not blaming on-call engineer
Example on-call schedule:
Week 1: Alice (primary), Bob (backup)
Week 2: Bob (primary), Charlie (backup)
Week 3: Charlie (primary), Dave (backup)
Week 4: Dave (primary), Alice (backup)
Escalation:
- Alert fires
- Page primary on-call (Alice)
- Alice has 5 minutes to acknowledge
- After 5 min, page manager
- Manager can escalate further to VP of Engineering
Actionable vs Informational Alerts
Bad alert (not actionable):
- alert: HighMemory
expr: node_memory_MemAvailable_bytes < 1000000000
annotations:
summary: "High memory usage detected"
# Engineer sees this alert: Now what?
yamlGood alert (actionable):
- alert: HighMemoryPressure
expr: node_memory_MemAvailable_bytes < 1000000000
for: 10m
severity: P2
annotations:
summary: "Node {{ $labels.node }} memory available < 1 GB"
description: "This usually indicates a memory leak or misconfiguration"
runbook: "https://wiki/memory-pressure-runbook"
dashboards:
- "https://grafana/d/memory-details/{{ $labels.node }}"
context:
container_count: "{{ query('count(container_memory_usage_bytes{node=\"{{ $labels.node }}\"}') }}"
top_containers: "List top 3 containers by memory"
yamlActionable runbook:
# Memory Pressure Runbook
## What's happening?
Node {{ $labels.node }} has less than 1 GB available memory, indicating:
- Pods are using more memory than expected
- Possible memory leak in a container
- Insufficient node capacity for workload
## Quick investigation
1. SSH to node: `ssh {{ $labels.node }}`
2. Check memory usage: `free -h`
3. Check top memory consumers: `docker stats --no-stream`
4. Check for memory leaks: `docker logs <container> | grep OOM`
## Actions to take
- **Short term**: Kill largest container or evict least critical pod
- **Medium term**: Add more nodes or resize existing nodes
- **Long term**: Right-size pod resource requests based on actual usage
## Escalate if
- Multiple nodes under memory pressure
- Unable to free memory
markdown4: Self-Healing Automation – Reducing MTTR
Traditional monitoring detects problems; self-healing fixes them automatically without human intervention.
4.1: Automated Remediation
The goal: Reduce MTTR (Mean Time To Recovery) from 30+ minutes to <5 minutes through automation.
Kubernetes Self-Healing via Restart Policies
Kubernetes natively handles many failure scenarios:
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
template:
spec:
# Restart policy for automatic recovery
restartPolicy: Always # Always restart failed pods
containers:
- name: order-service
image: order-service:v1.2.3
# Readiness probe: remove from load balancer if unhealthy
readinessProbe:
httpGet:
path: /health/ready
port: 8080
failureThreshold: 2
periodSeconds: 5
# Liveness probe: restart if unhealthy
livenessProbe:
httpGet:
path: /health/live
port: 8080
failureThreshold: 3
periodSeconds: 10
yamlHow this self-heals:
- Pod becomes unhealthy (readiness probe fails)
- Kubernetes immediately removes pod from service load balancer (users not affected)
- If pod still unhealthy after failureThreshold × periodSeconds:
- Liveness probe kills pod
- RestartPolicy causes immediate restart
- Fresh container starts
- If new container becomes ready → back in load balancer
- User impact: <5 seconds outage for that pod
Automatic recovery of common failures:
| Failure | Self-Healing Mechanism | MTTR |
|---|---|---|
| Pod crash | RestartPolicy: Always | <10 seconds |
| OOM error | Readiness probe detects, pod evicted, new pod scheduled | <30 seconds |
| Deadlock in app | Liveness probe detects stuck pod, restart | <20 seconds |
| Database connection timeout | Readiness probe detects, traffic removed | <5 seconds |
| Memory leak (gradual) | Liveness probe based on memory threshold, restart | Depends on threshold |
Custom Operators for Complex Recovery
For scenarios beyond Kubernetes defaults, Kubernetes operators (custom controllers) automate recovery.
Example: Database failover operator
# Custom operator: automatically promotes replica if primary fails
from kopf import (
on, Timer, FieldFunction,
patch, adopt, index
)
import logging
log = logging.getLogger()
@on.event(
'postgresql.io', 'v1', 'postgresqlcluster',
labels={'ha-enabled': 'true'}
)
def ensure_primary_alive(spec, name, namespace, **kwargs):
"""Check if primary is alive; promote replica if not."""
primary_pod = f"{name}-primary"
primary_namespace = namespace
# Check if primary is healthy
if not is_primary_healthy(primary_pod, primary_namespace):
log.info(f"Primary {primary_pod} is unhealthy, promoting replica")
# Find best replica
replicas = get_replicas(name, namespace)
best_replica = select_best_replica(replicas)
# Promote replica
promote_replica(best_replica, primary_namespace)
# Update DNS/load balancer to point to new primary
update_route(f"{name}-primary", best_replica)
# Old primary becomes replica when it recovers
demote_to_replica(primary_pod, primary_namespace)
log.info(f"Promoted {best_replica} to primary")
@on.event(
'postgresql.io', 'v1', 'postgresqlcluster',
labels={'ha-enabled': 'true'},
initial=False
)
def monitor_replication_lag(spec, name, namespace, **kwargs):
"""Alert if replication lag gets too high."""
replicas = get_replicas(name, namespace)
for replica in replicas:
lag = get_replication_lag(replica)
if lag > 10000: # 10 seconds
log.warning(f"High replication lag on {replica}: {lag}ms")
trigger_alert('high_replication_lag', {
'replica': replica,
'lag_ms': lag
})
pythonAutomated Rollback on Health Check Failure
When a new deployment causes errors, automatically rollback to previous version:
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
annotations:
# Custom annotation for rollback operator
auto-rollback: "true"
health-check-threshold: "0.05" # Rollback if error rate >5%
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0 # Zero unavailability
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:v1.2.4
livenessProbe:
httpGet:
path: /health/live
port: 8080
failureThreshold: 3
yamlOperator that monitors deployment health:
@on.update(
'apps', 'v1', 'deployment',
labels={'auto-rollback': 'true'},
field=FieldFunction('status.conditions[?type=="Progressing"]'),
old_value={'status': 'True'}, # Was progressing
new_value={'status': 'False'} # Now not progressing (stuck)
)
def rollback_on_failure(spec, status, name, namespace, **kwargs):
"""Rollback deployment if it gets stuck."""
# Get current error rate
error_rate = get_current_error_rate(name)
threshold = float(spec['annotations'].get('health-check-threshold', '0.05'))
if error_rate > threshold:
log.warning(f"Deployment {name} has high error rate {error_rate:.2%}")
# Get previous revision
prev_revision = get_previous_revision(name, namespace)
if prev_revision:
log.info(f"Rolling back to revision {prev_revision}")
# Rollback
patch_deployment(name, namespace, {
'spec': {
'template': {
'metadata': {
'labels': {
'revision': prev_revision
}
}
}
}
})
trigger_alert('deployment_rollback', {
'deployment': name,
'error_rate': error_rate,
'prev_revision': prev_revision
})
pythonPod Autoscaling with Custom Metrics
Horizontal Pod Autoscaler (HPA) automatically scales pods based on metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 5 # Minimum instances
maxReplicas: 50 # Maximum instances
metrics:
# Scale based on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up at 70% CPU
# Scale based on memory
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale up at 80% memory
# Scale based on custom metric (queue depth)
- type: Pods
pods:
metric:
name: queue_depth
target:
type: AverageValue
averageValue: "50" # Scale up when avg queue > 50 items
# Behavior (scaling speed)
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50 # Scale down by max 50% at a time
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Scale up by max 100% at a time
periodSeconds: 30
yamlCustom metric collection (Prometheus → HPA):
# Application exposes queue depth metric
from prometheus_client import Gauge
order_queue_depth = Gauge(
'order_queue_depth',
'Number of orders waiting in queue'
)
# HPA queries this metric and scales accordingly
# When queue_depth > threshold, HPA adds more pods
python4.2: Chaos Engineering – Learning from Failures
Chaos engineering is the discipline of injecting failures intentionally to test system resilience and discover weaknesses before they cause real outages.
Controlled Failure Injection Methodology
Rather than hoping failures don't happen, chaos engineers deliberately break things in controlled ways:
- Hypothesis: "If the payment service goes down, orders should fail gracefully with user-friendly error message"
- Experiment: Kill the payment service
- Observation: What actually happens?
- Learning: Use observations to improve architecture
Chaos Mesh for Kubernetes
Chaos Mesh is a chaos engineering platform for Kubernetes:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-order-service-pod
namespace: chaos
spec:
action: pod-failure # Kill pod
mode: fixed
value: 1 # Kill 1 pod
selector:
namespaces:
- production
labelSelectors:
app: order-service
duration: 5m # Run for 5 minutes
scheduler:
cron: "0 11 * * 1-5" # Run on weekdays at 11 AM (business hours)
---
# Network latency chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: add-order-service-latency
namespace: chaos
spec:
action: delay # Add network delay
mode: percentage
percentage: 50 # Affect 50% of traffic
selector:
namespaces:
- production
labelSelectors:
app: order-service
delay:
latency: "500ms" # Add 500ms latency
jitter: "100ms"
duration: 10m
---
# Packet loss chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss-payment-service
namespace: chaos
spec:
action: loss # Drop packets
mode: percentage
percentage: 20 # Lose 20% of packets
selector:
namespaces:
- production
labelSelectors:
app: payment-service
loss:
loss: "20%"
duration: 5m
---
# Disk failure chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: database-io-error
namespace: chaos
spec:
action: latency # Add I/O latency
mode: fixed
value: 1
selector:
namespaces:
- databases
labelSelectors:
app: postgres
latency: "500ms"
duration: 3m
yamlGame Days and Disaster Recovery Drills
Game day: Scheduled chaos engineering event where team practices incident response.
Example game day schedule:
9:00 AM - Team briefing, review objectives
9:15 AM - Chaos Mesh starts killing random pods
9:30 AM - First alert fires, team responds
10:00 AM - Escalated to VP of Engineering (test escalation path)
10:30 AM - Chaos stops, assessment begins
11:00 AM - Retrospective: What went wrong? What worked? What to improve?
12:00 PM - Post-mortem writeup and action items
Chaos scenarios for 99.95% uptime:
| Scenario | Chaos | Expected Behavior | Result |
|---|---|---|---|
| Pod failure | Kill 1 order-service pod | Service remains available (8 other pods) | ✅ PASS |
| Service dependency failure | Kill entire payment-service | Orders fail with graceful error | ✅ PASS |
| Database primary failure | Kill primary PostgreSQL | Automatic failover to replica (<30s) | ✅ PASS / ⚠️ Manual failover needed |
| Network latency | Add 1s latency to API calls | Requests timeout, circuitbreaker trips | ❌ FAIL - No fallback |
| Cascading failure | Kill API gateway + Auth service | Entire system unavailable | ❌ FAIL - Need backup gateway |
4.3: Blue-Green & Canary Deployments
Deployments are a major source of incidents. Blue-green and canary strategies reduce deployment risk to near-zero.
Zero-Downtime Deployment Strategies
Blue-Green Deployment:
Before:
┌─────────────────────────┐
│ Load Balancer │
└───────────┬─────────────┘
│
┌───────┴────────┐
↓ ↓
┌────────────┐ ┌────────────┐
│ Blue v1.2 │ │ Green v1.2│ (inactive)
│ (active) │ │ │
└────────────┘ └────────────┘
After deployment (v1.3 ready):
┌─────────────────────────┐
│ Load Balancer │
└───────────┬─────────────┘
│
┌───────┴────────┐
↓ ↓
┌────────────┐ ┌────────────┐
│ Blue v1.2 │ │ Green v1.3│ (new version)
│ (active) │ │ (validated)│
└────────────┘ └────────────┘
After switch:
(single atomic switch via load balancer)
┌─────────────────────────┐
│ Load Balancer │
└───────────┬─────────────┘
│
┌───────┴────────┐
↓ ↓
┌────────────┐ ┌────────────┐
│ Blue v1.2 │ │ Green v1.3│ (active)
│ │ │ (rollback │
│(ready to │ │ ready) │
│rollback) │ │ │
└────────────┘ └────────────┘
If issue found: One switch back to Blue (v1.2)
No gradual shift → no partial failures
Kubernetes blue-green:
# Blue deployment (active)
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-blue
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: order-service
slot: blue
template:
metadata:
labels:
app: order-service
slot: blue
spec:
containers:
- name: order-service
image: order-service:v1.2.3 # Current version
---
# Green deployment (new version, not receiving traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-green
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: order-service
slot: green
template:
metadata:
labels:
app: order-service
slot: green
spec:
containers:
- name: order-service
image: order-service:v1.2.4 # New version
---
# Service routes to blue (active)
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: production
spec:
selector:
app: order-service
slot: blue # Currently points to blue
ports:
- port: 80
targetPort: 8080
---
# Deployment procedure:
# 1. Deploy green with new version (no traffic yet)
# kubectl apply -f green-deployment.yaml
#
# 2. Test green (internal traffic, synthetic tests)
# kubectl port-forward svc/order-service-green 8080:80
#
# 3. Once validated, switch service to green:
# kubectl patch service order-service -p '{"spec":{"selector":{"slot":"green"}}}'
#
# 4. If issue: switch back to blue
# kubectl patch service order-service -p '{"spec":{"selector":{"slot":"blue"}}}'
#
# 5. Once confident: update blue to new version, switch to blue, retire green
yamlCanary Releases with Automated Rollback
Canary deployment gradually shifts traffic instead of atomic switch:
0 min: Blue 100%, Green 0%
5 min: Blue 95%, Green 5%
10 min: Blue 90%, Green 10%
15 min: Blue 75%, Green 25%
20 min: Blue 50%, Green 50%
25 min: Blue 25%, Green 75%
30 min: Blue 0%, Green 100%
During this time, monitor Green's error rate, latency, etc.
If metrics degrade, automatically rollback
Flagger + Istio canary:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: order-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
# Service mesh control
service:
port: 80
targetPort: 8080
analysis:
interval: 1m # Check metrics every minute
threshold: 5 # Max 5 consecutive failed checks before rollback
maxWeight: 50 # Max 50% traffic to canary
stepWeight: 10 # Increase traffic by 10% every iteration
# Success criteria
metrics:
- name: http-success-rate
thresholdRange:
min: 99 # Must stay >99% success
interval: 1m
- name: http-request-duration
thresholdRange:
max: 500 # p99 latency must stay <500ms
interval: 1m
# Custom metrics
- name: error_rate
query: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
thresholdRange:
max: 0.05 # Error rate <5%
# Webhook for additional validation
skipAnalysis: false # Run analysis before rollout
# Automated rollback
skipFinalCheck: false
# Alert notification
alerts:
- name: PagerDuty
severity: critical
provider:
type: pagerduty
address: https://pagerduty.example.com
yamlHow Flagger works:
- New version (canary) is deployed alongside current version (stable)
- Service mesh (Istio) gradually shifts traffic to canary
- Flagger queries metrics (success rate, latency) from Prometheus
- If metrics stay healthy: continue shifting traffic
- If metrics degrade: immediately rollback to previous version
- If full shift succeeds: promote canary to stable, retire old version
A/B Testing Infrastructure
While canary tests for regressions, A/B testing validates new features:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
namespace: production
spec:
hosts:
- order-service
http:
# Route 50% to stable (v1.2), 50% to canary (v1.3)
- match:
- sourceLabels:
user-cohort: "treatment" # Users in A/B test
route:
- destination:
host: order-service
port:
number: 80
subset: v1-3
weight: 100 # All treatment users get v1.3
# Control group gets stable version
- route:
- destination:
host: order-service
port:
number: 80
subset: v1-2 # Stable version
weight: 100
---
# Track A/B test metrics
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-v1-3
spec:
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "9090"
spec:
containers:
- name: order-service
env:
# Tag all metrics with experiment ID
- name: EXPERIMENT_ID
value: "exp_new_checkout_flow_2025-12"
- name: EXPERIMENT_GROUP
value: "treatment"
yamlAnalyzing A/B test results:
# Compare conversion rate between control and treatment
# Treatment (new checkout flow)
conversion_rate_treatment =
sum(rate(order_completed{experiment_group="treatment"}[1h])) /
sum(rate(order_initiated{experiment_group="treatment"}[1h]))
# Control (stable)
conversion_rate_control =
sum(rate(order_completed{experiment_group="control"}[1h])) /
sum(rate(order_initiated{experiment_group="control"}[1h]))
# If conversion_rate_treatment > conversion_rate_control:
# Roll out new flow to all users
# Else:
# Investigate why new flow underperforms
promql5: Incident Response & Postmortems
Despite best efforts, incidents still occur. How you respond determines whether an hour of downtime becomes a catastrophe or a learning opportunity.
Incident Command Structure
Effective incident response requires clear roles:
| Role | Responsibilities |
|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions, communicates status |
| Technical Lead | Diagnoses and fixes the problem |
| Communications Lead | Updates stakeholders (customers, management, team) |
| Scribe | Documents timeline, decisions, and actions taken |
Incident levels:
SEV-1 (Critical)
├─ Complete service outage or severe degradation
├─ Customer-facing impact
├─ Requires immediate response
└─ Page all on-call engineers
SEV-2 (High)
├─ Service degraded but partially functional
├─ Impact limited to subset of customers/features
├─ Requires quick response
└─ Page on-call engineer
SEV-3 (Medium)
├─ Service has issues but workaround exists
├─ Minimal impact to customers
└─ Can wait for next business day if after-hours
SEV-4 (Low)
├─ Informational or very limited impact
└─ Log and address during normal work
Runbook Development
Runbooks are step-by-step procedures for responding to common incidents.
Example: Database Replication Lag Runbook
# Database Replication Lag Incident
## Alert
- Alert: `DatabaseReplicationLagHigh`
- Threshold: >30 seconds
- Severity: P2
## Quick diagnosis
1. Check current replication lag:
```sql
SELECT now() - pg_last_wal_receive_lsn() AS replication_lag;
markdown-
Check replica state:
sqlSELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots; -
Check network traffic between primary and replicas:
bashiftop -i eth0 -n # see bandwidth to replica
Common causes and fixes
Cause 1: Network latency
- High replication lag but replica is responsive
- Fix: Check network connectivity
bashping <replica-ip> mtr <replica-ip> # traceroute with packet loss
Cause 2: Replica falling behind due to heavy writes
- High write volume on primary
- Replica CPU/disk I/O maxed out
- Fix:
- Reduce write load on primary (pause non-critical writes)
- Scale up replica (bigger instance type)
Cause 3: Replica crashed or hung
- Lag keeps increasing
- Replica unresponsive
- Fix:
- Kill hung process:
kill -9 <pid> - Restart PostgreSQL:
systemctl restart postgresql - Monitor recovery:
tail -f /var/log/postgresql/postgresql.log
- Kill hung process:
When to escalate
- Cannot reduce replication lag below 10 seconds within 15 minutes
- Replica completely unreachable (network issue)
- Data corruption detected
Escalation path
- Escalate to Database Team Lead if not resolved in 15 min
- Contact AWS Support if infrastructure issue suspected
### Real Incident Case Study (Anonymized)
**Incident: Order Processing Delayed - Case Study**
**Timeline:**
14:32 - Monitoring alert: "Order processing latency >5s" (P2) Engineering on-call (Alice) paged
14:35 - Alice acknowledges alert, starts investigation Checks dashboard: Average order processing time 8.5s (vs normal 200ms) Starts incident in Slack: #incident-channel
14:37 - Alert severity escalates to P1: "Order success rate dropping" Error rate 15% (threshold: 5%) Communication lead (Bob) paged to notify customers
14:39 - Alice identifies bottleneck: "payment-service" has 10s+ latency Creates trace in Jaeger to see what's happening
14:42 - Alice calls Technical Lead (Charlie) to understand payment-service Charlie: "We haven't changed anything. Let me check." Checks payment-service logs: "Database connection pool exhausted"
14:45 - Charlie: "Order-service might be leaking database connections" Checks order-service connections: 485/500 max connections in use Connections not being returned to pool
14:47 - Root cause identified: "Database connection leak in order-service" Connection pool slowly fills up, starves new requests
14:48 - Quick mitigation: Restart order-service pods to clear connection pool Immediate improvement: latency drops to 500ms, error rate drops to 0.5%
14:51 - Second-order issue: Payment service still slow Charlie: "Payment-service also has connection leak" Restart payment-service pods
14:54 - System fully recovered All metrics normal Alice: "Incident resolved. Postmortem in 2 hours."
15:00 - Bob notifies customers: "Issue resolved, thank you for patience"
18:00 - Postmortem meeting (Alice, Charlie, Bob, 2 engineers from both teams)
**Timeline diagram:**
14:30 ├─ Latency alert fires 14:32 ├─ Alice paged, investigates 14:35 ├─ Error rate increases, P1 alert 14:37 ├─ Communications alert, customers notified 14:42 ├─ Root cause identified: connection leak 14:45 ├─ Mitigation: restart order-service 14:48 ├─ Payment-service also needs restart 14:54 ├─ Full recovery └─ Total incident duration: 22 minutes
### Blameless Postmortem Process
**Postmortem goal**: Learn from incident, improve systems so it doesn't happen again.
**NOT to assign blame** - blame erodes psychological safety and prevents honest learning.
**Postmortem template:**
```markdown
# Postmortem: Order Processing Outage - 2025-12-10 14:32-14:54
## Summary
- Duration: 22 minutes
- Impact: Order processing delayed, 15% error rate
- Affected customers: ~2,000 active users
- Estimated revenue impact: ~$15,000
## Timeline
- 14:32: Latency alert fires
- 14:37: Error rate exceeds threshold, P1 alert triggered
- 14:42: Root cause identified (connection leak)
- 14:48: Payment service restarted, recovery begins
- 14:54: Full recovery, all metrics normal
## Root Cause Analysis
### Primary cause
Connection pool leak in order-service and payment-service:
```python
# BEFORE (buggy code)
def process_order(order_id):
connection = db_pool.get_connection()
try:
return process_payment(order_id, connection)
except Exception:
pass # BUG: Connection never returned on exception
# AFTER (fixed)
def process_order(order_id):
connection = db_pool.get_connection()
try:
return process_payment(order_id, connection)
finally:
db_pool.return_connection(connection) # Always return
Why it wasn't caught earlier
- Connection leak is gradual (accumulates over hours)
- No automated pool exhaustion alert (added today)
- Code review didn't catch exception path
- No load testing to trigger the issue
Impact
- Customers unable to place orders for 22 minutes
- ~500 orders failed
- Customer support team fielded complaints
- Reputation impact: Some customers expressed frustration on Twitter
What went well
- Alert response: Paged on-call engineer within 2 minutes
- Investigation: Root cause identified within 10 minutes
- Mitigation: Restart pods quickly restored service
- Communication: Customer communication timely and honest
- Postmortem: Team focused on learning, not blame
What could have been better
- No connection pool exhaustion alert (NEW: Added P2 alert)
- Connection pool leak not caught in code review
- No integration tests for exception paths (NEW: Added test)
- No load testing in staging before deploy
Action items
| Priority | Action | Owner | Due |
|---|---|---|---|
| P0 | Add connection pool exhaustion alert | SRE-Alice | 2025-12-11 |
| P1 | Add integration test for connection cleanup on exception | Eng-Charlie | 2025-12-13 |
| P1 | Add load testing to staging pipeline | Infra-Dave | 2025-12-17 |
| P2 | Review all database connection pools for similar leaks | Eng-Team | 2025-12-20 |
| P3 | Add connection pool metrics to dashboard | SRE-Alice | 2025-12-24 |
Lessons learned
- Connection leaks are insidious: gradual failure, no obvious cause
- Missing metrics (pool exhaustion) delayed diagnosis
- Multiple services with same bug = need systematic code review
- Load testing in staging would have caught this
Process improvements
- Implement automatic load testing before deploy
- Add resource exhaustion alerts for all pools (DB, HTTP, thread, etc.)
- Require exception handling code review checklist
- Monthly "connection leak" audit across all services
6: Results & Reliability Metrics
After 12 months of implementing the practices outlined above, the organization achieved:
Uptime Improvement Timeline: 98.2% → 99.95%
Month 1 (Baseline): 98.2% uptime (19.4 hours downtime)
├─ Incidents: 3 major
├─ MTTR: 42 minutes
└─ MTBF: 6 days
Month 2-3 (Architecture): 98.7% uptime
├─ Multi-AZ deployment completed
├─ Database failover automated
├─ Incidents: 2 major
└─ MTTR: 35 minutes
Month 4-6 (Monitoring): 99.1% uptime
├─ Prometheus + Grafana deployed
├─ Alerting rules refined
├─ Incidents: 1 major, 2 minor
└─ MTTR: 18 minutes
Month 7-9 (Self-Healing): 99.5% uptime
├─ Automated pod restart policies
├─ Custom operators for complex recovery
├─ Incidents: 1 minor, several prevented
└─ MTTR: 8 minutes
Month 10-12 (Chaos & Culture): 99.95% uptime
├─ Chaos engineering practices
├─ Blameless postmortems
├─ Preventive measures
├─ Incidents: 0 major (prevented)
└─ MTTR: 3 minutes (when incidents occur)
MTTR and MTBF Optimization
| Metric | Baseline | After 12 Months | Improvement |
|---|---|---|---|
| MTTR (Mean Time To Recovery) | 45 minutes | 3 minutes | 93% reduction |
| MTBF (Mean Time Between Failures) | 6 days | 90 days | 1,400% improvement |
| Incident frequency (per month) | 4–5 | 0–1 | 80% reduction |
| Incident severity (avg) | P1/P2 | P3/P4 | Mostly prevented |
| Customer impact (incidents/customers) | 2,000–5,000 | <100 | 95% reduction |
Monthly Uptime Comparison
| Month | Uptime | Downtime | Incidents | Status |
|---|---|---|---|---|
| 2025-01 | 98.2% | 19.4h | 4 | Baseline |
| 2025-02 | 98.5% | 14.2h | 3 | Arch changes |
| 2025-03 | 98.9% | 10.6h | 2 | Monitoring |
| 2025-04 | 99.0% | 8.7h | 2 | Alerting tuning |
| 2025-05 | 99.2% | 5.8h | 1 | Self-healing |
| 2025-06 | 99.3% | 5.0h | 1 | Operators |
| 2025-07 | 99.5% | 3.6h | 1 | Chaos |
| 2025-08 | 99.6% | 2.9h | 0 | Culture |
| 2025-09 | 99.7% | 2.2h | 0 | Prevented |
| 2025-10 | 99.85% | 1.1h | 0 | Sustained |
| 2025-11 | 99.9% | 0.7h | 0 | Sustained |
| 2025-12 | 99.95% | 0.4h | 0 | Target achieved |
Cost of Reliability Investment vs Downtime Prevented
Investment breakdown:
| Category | Cost |
|---|---|
| Engineering (5 FTE × $150K × 1 year) | $750,000 |
| Infrastructure (HA, multi-region, redundancy) | $200,000 |
| Tools (Prometheus, Grafana, Jaeger, Chaos Mesh) | $50,000 |
| Training and hiring | $30,000 |
| Total investment | $1,030,000 |
Downtime prevented:
| Metric | Value |
|---|---|
| Downtime reduced (126h → 26.3h) | 99.7h |
| Cost per hour of downtime | $5,600 |
| Downtime cost prevented | $558,320 |
| Also prevented | |
| Lost customer subscription revenue | $800,000+ |
| Brand reputation damage | $500,000+ |
| SLA penalties | $200,000+ |
| Total value created | $2,058,320+ |
ROI: (Total value - Investment) / Investment = ($2,058,320 - $1,030,000) / $1,030,000 = 100% ROI in first year
Plus: Ongoing value of maintaining 99.95% uptime = $5+ million/year in prevented downtime.
Conclusion: Building a Reliability Culture
Achieving 99.95% uptime is not primarily a technical problem. While this article covered technology (Kubernetes, Prometheus, Istio, etc.), the real foundation is cultural.
Key Patterns with Biggest Impact
1. Shared ownership of reliability across all teams, not just "ops"
- Product teams own their service's reliability SLO
- Engineering teams write reliability tests
- Decision-making includes reliability trade-offs
2. Error budgets make reliability economically rational
- Teams no longer view reliability as constraint on velocity
- Instead: "We have 26 hours of downtime budget per year. How do we use it?"
- Natural incentive: use that budget for feature development, not accidents
3. Psychological safety enables honest incident postmortems
- Blame culture → engineers hide failures → repeat mistakes
- Blameless culture → engineers learn from failures → prevent recurrence
- Google found 5x higher MTTR when postmortems are blame-focused
4. Automation over heroics
- Don't hire better on-call engineers; reduce incidents and MTTR via automation
- Heroic 3am incident recovery is exhausting and unsustainable
- Self-healing systems allow engineers to focus on preventing problems
5. Observability enables prevention
- Traditional monitoring detects fires; observability predicts them
- Traces show bottlenecks before users complain
- Logs reveal root causes faster
- Metrics enable proactive scaling
When to Accept Lower Reliability
99.95% uptime is not always optimal. Consider lower targets when:
Internal/non-critical systems: Development infrastructure, analytics dashboards, internal tools
- Accept 99% uptime (87 hours/year downtime)
- Focus resources on customer-facing systems
- Still need some reliability (prevent major outages) but not 99.95%
Prototype/MVP stages: New product being validated with early customers
- Accept 99% uptime initially
- Graduate to 99.95% once product-market fit confirmed
- Avoid over-engineering before product is proven
Cost-benefit doesn't justify: For very low-traffic services
- 99.95% uptime costs $1M/year
- If service generates only $500K/year, 99.0% is more rational
- Optimize resource allocation to high-value services
User base can't absorb: Niche products with <1,000 users
- 99.95% uptime = 4.4 hours downtime/year = 0.37 hours/month
- For small user base, manual fixes adequate
- Automate only when scale justifies investment
Future Reliability Investments
Beyond 99.95%, moving toward 99.99% requires:
1. Global distribution (eliminate single-region failures)
- Multi-region active-active setup
- Automatic failover between regions
- Cost: ~3-5x infrastructure
2. Deeper observability (prevent invisible failures)
- Continuous synthetic testing (every second from multiple locations)
- Automated chaos engineering (daily controlled failures)
- Distributed tracing for every request (not sampled)
3. Smarter automation (faster remediation)
- ML-based anomaly detection (find problems humans miss)
- Predictive scaling (scale before metrics degrade)
- Automated root cause analysis
4. Business continuity (survive even extreme failures)
- Cross-region database consistency
- Backup payment processors
- Manual override procedures for when automation fails
Appendix: Technical Artifacts
Complete Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'us-east-1'
scrape_configs:
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# All pods (service discovery)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Alert rules
rule_files:
- '/etc/prometheus/alert-rules.yaml'
yamlComplete Alert Rules
groups:
- name: kubernetes-alerts
rules:
# Pod not ready
- alert: PodNotReady
expr: min(kube_pod_status_ready{condition="false"}) by (pod, namespace) == 1
for: 10m
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} not ready"
# High error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
) > 0.05
for: 5m
severity: critical
annotations:
summary: "{{ $labels.job }} error rate > 5%"
value: "{{ $value | humanizePercentage }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1.0
for: 5m
severity: warning
annotations:
summary: "{{ $labels.job }} p95 latency > 1s"
value: "{{ $value }}s"
# Database connection pool exhausted
- alert: DBConnectionPoolExhausted
expr: |
pg_stat_activity{state="active"} / pg_settings_max_connections > 0.8
for: 5m
severity: critical
annotations:
summary: "Database connection pool {{ $labels.datname }} > 80% full"
yamlFinal Thoughts
Achieving 99.95% uptime with 200+ microservices is hard work. It requires:
- Technical excellence: Multi-AZ redundancy, automated failover, observability
- Operational discipline: Clear incident procedures, blameless postmortems, continuous improvement
- Cultural foundation: Shared responsibility, psychological safety, error budgets
- Sustained investment: Not a one-time project but ongoing discipline
The organizations that achieve this level of reliability gain:
- Competitive advantage: Reliability becomes a selling point
- Team morale: Engineers proud of systems, not exhausted by incidents
- Financial stability: Downtime prevented = revenue protected + brand protected
For any team operating critical infrastructure, the principles in this article apply regardless of scale. Start small (99.9% uptime), master the fundamentals, then evolve toward higher targets.