⭐ Featured ArticleInfrastructure & OperationsAdvanced

Achieving 99.95% Uptime: Building Self-Healing Infrastructure for 200+ Microservices

A complete technical guide to architecting, deploying, and operating 200+ microservices with 99.95% uptime (4.4 hours downtime per year). Covers reliability engineering principles, multi-region architecture, observability at scale, self-healing automation, chaos engineering, and incident response. Includes detailed code examples, diagrams, and a proven roadmap from 98.2% to 99.95% uptime.

Yogesh Bhandari📅 December 3, 2025⏱️ 150 min read
Code ExamplesImplementation Guide

Tech Stack:

KubernetesIstioPrometheusGrafanaJaegerPostgreSQLRedisKafkaRabbitMQPagerDutyElasticsearchLogstashKibanaVaultConsulArgoCDFlaggerChaos MeshPythonGo
#SRE#Site Reliability Engineering#Uptime#Microservices#Kubernetes#High Availability#Observability#Monitoring#Distributed Systems#Self-Healing#Infrastructure#Incident Response#Chaos Engineering#Prometheus#Jaeger#Istio#Service Mesh#Resilience#Disaster Recovery#Production Operations

Introduction: The Cost of Downtime and the Path to 99.95% Uptime

In modern digital commerce and software-as-a-service (SaaS) businesses, downtime is not merely an operational inconvenience—it is a direct financial loss with cascading consequences. Every minute of unavailability translates into lost transactions, damaged customer relationships, brand reputation impact, and regulatory non-compliance.

The Financial Impact of Downtime

Consider a typical enterprise SaaS application serving thousands of customers:

  • Enterprise customers generating $50K–$500K in annual recurring revenue (ARR)
  • Average downtime impact: $5,600 per minute across the customer base
  • Broken down: 1–2 minutes of downtime = $5,600–$11,200 in lost revenue, customer churn risk, support costs, and brand damage

For a global platform with operations across multiple regions:

  • 5 minutes of unplanned downtime = $28,000 in immediate impact
  • 1 hour of downtime = $336,000 in direct and indirect costs
  • 8 hours of downtime (worst case) = $2.7 million

Beyond financial loss, downtime causes:

  • Customer churn: 23% of customers will switch to competitors after a single hour of downtime
  • Support load: Spike in support tickets, angry customers, brand reputation damage
  • Employee morale: Engineering teams stress, blame culture, burnout
  • Regulatory implications: SLA breaches, penalty clauses, potential legal liability

This financial reality makes reliability engineering not just a technical practice, but a core business imperative.

Baseline State: 98.2% Uptime and the Cost of Drift

The organization at the center of this case study began with 98.2% uptime—a respectable number on the surface but revealing when examined in detail:

98.2% uptime translates to:

  • 126 hours of downtime per year (approximately 5.25 days)
  • ~10 hours per month of unplanned outages
  • ~150 minutes per week of service disruption on average

For a company operating 200+ interdependent microservices, this level of reliability was insufficient for:

  • Enterprise SLA commitments (customers demanding 99.9%+ uptime)
  • Competitive positioning (competitors achieving 99.99%)
  • Board-level confidence (C-suite viewing reliability as competitive disadvantage)
  • Employee satisfaction (on-call teams exhausted by frequent incidents)

Target: 99.95% Uptime and the Acceptable Downtime Budget

The organization set a bold but achievable target: 99.95% uptime, commonly referred to as "four nines and five."

99.95% uptime translates to:

  • 26.3 hours of downtime per year (~2.2 days)
  • ~2.2 hours per month of allowed downtime
  • ~5 minutes per week of acceptable downtime

This is a critical distinction from 98.2%: the organization effectively cut its acceptable downtime from 126 hours to 26 hours per year—a reduction of 79%.

For context:

Uptime %Minutes/YearHours/YearDays/YearDays/Month
99.0%5,25687.63.650.30
99.5%2,62843.81.830.15
99.9%5268.760.370.03
99.95%2634.40.180.015
99.99%52.60.880.040.003

Reaching 99.95% required architectural changes, operational discipline, and engineering rigor across every layer of the infrastructure.

The Challenge: Complexity at Scale

The organization operated 200+ microservices across multiple environments:

  • Production microservices: 180+ services
  • Infrastructure services: 35+ (databases, message queues, caches, monitoring)
  • Deployment frequency: 400–600 deployments per week
  • Peak traffic: 50,000 requests per second
  • Geographic distribution: 4 regions, 12 availability zones
  • Engineering teams: 150+ engineers across 20+ teams

Key challenges in this environment:

  1. Distributed complexity: 200+ services = 200+ potential points of failure
  2. Dependency chains: Critical requests touch 8–15 services; failure at any point breaks the chain
  3. Unequal reliability: Some services are battle-tested; others relatively new and fragile
  4. Organizational silos: Teams own services independently without visibility into how their reliability affects others
  5. Operational burden: Growing on-call load, alert fatigue, manual remediation consuming thousands of engineering hours annually

Why Traditional Monitoring Isn't Enough

Before addressing the solution, it's important to understand why traditional monitoring and alerting, while necessary, is insufficient for achieving 99.95% uptime:

Problem 1: Reactive vs Proactive

Traditional monitoring detects problems after they occur. A service fails, Prometheus alert fires, on-call engineer wakes up, investigates, and fixes.

Timeline: Failure → Detection (2–5 min) → Diagnosis (5–15 min) → Remediation (10–30 min) = 17–50 minute MTTR (Mean Time To Recovery)

At 99.95% uptime, you have only 5 minutes of acceptable downtime per week. A single incident often exceeds this budget.

Problem 2: Alert Fatigue

Monitoring 200+ services at traditional thresholds generates hundreds of alerts per day. Teams ignore alerts (alert fatigue), miss critical ones, and spend time managing noise instead of fixing root causes.

Problem 3: Invisible Dependencies

A service's degradation may not trigger its own alerts but affects downstream services. Traditional monitoring doesn't show the dependency graph or explain why a customer is experiencing poor performance.

Problem 4: Manual Remediation

Most outages are resolved through manual intervention—restarts, failovers, config updates. This process is:

  • Slow (10–60 minute MTTR)
  • Error-prone (risk of wrong commands, incomplete fixes)
  • Inconsistent (different engineers follow different procedures)

The Solution: Self-Healing, Autonomous Infrastructure

To achieve 99.95% uptime with 200+ microservices, the organization implemented an autonomous, self-healing infrastructure with these characteristics:

  1. Automatic failure detection and remediation (eliminating manual steps)
  2. Redundancy at every layer (no single point of failure)
  3. Graceful degradation (system prioritizes critical services, sheds load gracefully)
  4. Observability at scale (visibility into every service, request, and failure)
  5. Predictive health management (fixing problems before they cause outages)
  6. Rapid rollback (automated or one-click recovery from bad deployments)

The rest of this article details the architecture, systems, practices, and tools that enable this level of reliability.


1: Reliability Engineering Fundamentals

Achieving 99.95% uptime requires a foundation of reliability engineering principles and practices. This section covers the conceptual framework that guides all subsequent architectural and operational decisions.

SRE Principles Applied to a 200+ Microservices Organization

Site Reliability Engineering (SRE), pioneered at Google, is a discipline that treats operations as an engineering problem. Instead of viewing operations as distinct from development, SRE integrates reliability into every stage of system design and operation.

Core SRE principles applicable to this case:

Principle 1: Embrace Risk

Perfect reliability is impossible and economically unwise. A system with 100% uptime SLA requires:

  • Redundancy on every component (10x cost multiplier)
  • Extreme over-provisioning to handle any failure scenario
  • Inability to deploy or change (frozen code path)

Instead, SRE explicitly embraces acceptable risk through error budgets: "If we have a 99.95% uptime SLA, we can afford 26.3 hours of downtime per year. How do we use that budget wisely?"

Principle 2: Prioritize Reliability Over Velocity

When a team is operating within their error budget, they can deploy rapidly and innovate. When they've used their error budget (exceeded downtime quota for the month), they shift focus to reliability: more testing, less feature development, careful deployments.

This creates a natural incentive structure where teams optimize for reliability because it enables velocity.

Principle 3: Automate Toil

SRE teams spend 50% of their time on toil—repetitive, manual tasks (incident response, runbook execution, deployment orchestration, alerting triage).

The other 50% is spent eliminating that toil through:

  • Automation of common remediation tasks
  • Better tools and dashboards
  • Architectural improvements that reduce failure modes

The goal: reduce operational burden and free humans for higher-value work (architecture, capacity planning, incident prevention).

Principle 4: Measure Everything

"What gets measured gets managed." SRE emphasizes quantitative metrics for reliability:

  • SLOs and SLIs (Service Level Objectives and Indicators)
  • MTBF and MTTR (Mean Time Between Failures and Mean Time To Recovery)
  • Error budget consumption
  • Alert response time and resolution time

This creates a data-driven culture where reliability decisions are based on evidence, not gut feel.

Error Budgets and SLO/SLI Definitions

An error budget is the inverse of your SLA. If you commit to 99.95% uptime, you have a 0.05% error budget—or 26.3 hours of "allowed" downtime per year.

This budget is not spent haphazardly. Instead, it guides decision-making:

  • Within budget: Teams can deploy frequently, experiment, take calculated risks
  • Approaching budget: Deployment rate slows, focus shifts to testing and stability
  • Exhausted budget: Deployment freeze, all effort on stability improvements

Defining SLIs and SLOs

SLI (Service Level Indicator): A measurable metric indicating whether your service is functioning as expected.

Examples:

  • Availability SLI: Percentage of requests that succeeded (non-5xx responses)
  • Latency SLI: Percentage of requests completed within 100ms
  • Error rate SLI: Percentage of requests without errors or timeouts

SLO (Service Level Objective): A target for your SLI, defining the acceptable level of service.

Examples:

  • Availability SLO: 99.95% of requests succeed
  • Latency SLO: 99.5% of requests complete within 100ms
  • Error rate SLO: <0.1% error rate

Hierarchy:

SLA (Service Level Agreement)
  ↓
  Commits to customer (legal, contractual)
  Typically: 99.5%–99.99% depending on service tier
  
SLO (Service Level Objective)
  ↓
  Internal target (what we aim for)
  Typically: 99.95%–99.99% (higher than SLA for buffer)
  
SLI (Service Level Indicator)
  ↓
  Measurement (how we know we're meeting SLO)
  Measured continuously via monitoring

Calculating Acceptable Downtime

For the organization's target of 99.95% uptime, the acceptable downtime is calculated as:

Total minutes/year = 365 days × 24 hours × 60 min = 525,600 minutes

Acceptable downtime = (1 - 99.95%) × 525,600 = 0.0005 × 525,600 = 262.8 minutes

In hours: 262.8 / 60 = 4.38 hours (~4 hours 23 minutes)

Per month: 4.38 / 12 = 0.365 hours (~22 minutes)

Per week: 4.38 / 52 = 0.084 hours (~5 minutes)

This calculation is crucial because it creates discipline around incident management:

  • If you have 5 minutes of downtime budget per week and experience a 7-minute outage, you've already exceeded your budget
  • This forces critical analysis: did you need to deploy that change? Could you have tested better? Was failover implemented correctly?

Service Dependency Mapping

For 200+ microservices, understanding dependencies is critical. A failure in an upstream service can cascade through the system, causing apparent failures in unrelated downstream services.

The challenge: Capturing and maintaining accurate dependency maps with so many services.

The solution: Automated dependency mapping through distributed tracing and service mesh observability.

Dependency Graph Structure

User Requests
    ↓
API Gateway
    ├→ Authentication Service
    │  └→ Identity Provider
    ├→ Product Service
    │  ├→ Inventory Service
    │  │  └→ PostgreSQL (primary DB)
    │  │     └→ Redis Cache
    │  └→ Pricing Service
    │     └→ Configuration Service
    ├→ Order Service
    │  ├→ Product Service (see above)
    │  ├→ Payment Service
    │  │  └→ Payment Provider API
    │  ├→ Notification Service
    │  │  └→ Message Queue (Kafka)
    │  └→ Order Database
    │     └→ PostgreSQL (primary DB)
    └→ Recommendation Service
       └→ ML Model Service
          └→ Model Store (S3)

This dependency graph reveals:

  • Critical path services: Failures in API Gateway, Auth, or Order Service cascade to many downstream services
  • Single points of failure: If PostgreSQL primary goes down, multiple services are affected
  • Circular dependencies: Sometimes exist (e.g., Service A calls B, B calls A indirectly), which can cause cascading failures

Automated Dependency Discovery

Using a service mesh (Istio) with distributed tracing, dependencies are automatically discovered:

# Query Jaeger or Kiali (Istio dashboard) to get dependency graph
# This shows all services that call each other in production

# Example: Get all services that call "order-service"
upstream_services = jaeger_client.query_services(called_by="order-service")
# Returns: [api-gateway, recommendation-service, notification-service]

# Get all services that order-service calls
downstream_services = jaeger_client.query_services(calls="order-service")
# Returns: [payment-service, product-service, inventory-service, order-db]
python

Single Points of Failure Identification

SPOF (Single Point of Failure) analysis is the process of identifying components whose failure would cause system-wide outage.

Common SPOFs in microservices architectures:

  1. Primary database (single master, no replica failover)
  2. API gateway (single instance handling all traffic)
  3. Message queue (single cluster, no replication)
  4. Cache layer (single Redis instance, no sentinel)
  5. Configuration service (centralized, no fallback)
  6. DNS (misconfigured or single provider)
  7. Load balancer (single NLB with no redundancy)

Identification methodology:

For each critical service:

  1. Trace the dependency path from user to service and back
  2. Identify each component in the path
  3. Ask: If this component fails, will the entire system fail?
  4. If yes: It's a SPOF and needs redundancy

Remediation for identified SPOFs:

SPOFSolutionImplementation
Primary DatabaseMulti-master replication or hot standbyPostgreSQL HA with Patroni; automated failover
API GatewayMultiple instances behind load balancerTerraform + ALB/NLB across AZs
Message QueueCluster with replicationKafka with 3+ brokers, replication factor 3
Cache (Redis)Redis Sentinel or Redis ClusterSentinel for failover; Cluster for sharding
Config ServiceMultiple replicas with eventual consistencyConsul or etcd with 3+ nodes
DNSMultiple DNS providersRoute53 + Cloudflare or similar
Load BalancerMulti-region failoverAWS ALB/NLB across regions

Service Dependency and Reliability Table

ServiceRoleDependenciesFailure ImpactCriticalitySPOF Status
API GatewayEntry point for all trafficLoad balancer, Auth serviceAll user requests failCRITICALYES - requires redundancy
Auth ServiceUser authenticationIdentity Provider, Token cacheAuth failures cascadeCRITICALPartial (depends on ID provider)
Order ServiceProcess ordersPayment Service, Inventory, DBOrders cannot be placedCRITICALDepends on DB failover
Product ServiceRetrieve product dataInventory Service, CacheBrowse/search brokenHIGHCache failover needed
Inventory ServiceTrack stock levelsProduct DBStock visibility brokenHIGHDB failover critical
Notification ServiceSend emails, SMSMessage Queue, SMTP providerDelayed notificationsMEDIUMDepends on queue clustering
Recommendation ServiceML-based recommendationsML Model Service, CacheRecommendations unavailableLOWCan degrade gracefully
Payment ServiceProcess paymentsPayment Provider API, Payment DBTransactions failCRITICALRetry logic + DB failover

2: Architecture for Resilience

With reliability principles established, we turn to architecture design that enables 99.95% uptime across 200+ microservices.

2.1: High Availability Design Patterns

High availability (HA) means designing systems to be continuously operational, even when individual components fail.

Multi-AZ Deployment Architecture

The first line of defense against downtime is geographic redundancy within a region.

AWS provides Availability Zones (AZs): isolated datacenters within a region with independent power, networking, and cooling. When properly architected, failure of one AZ should not affect your application.

Multi-AZ architecture principles:

Region (us-east-1)
├─ AZ 1 (us-east-1a)
│  ├─ Kubernetes Node 1 (m5.2xlarge)
│  ├─ Kubernetes Node 2 (m5.2xlarge)
│  ├─ RDS Primary Database
│  ├─ ElastiCache Redis Primary
│  └─ NAT Gateway
├─ AZ 2 (us-east-1b)
│  ├─ Kubernetes Node 3 (m5.2xlarge)
│  ├─ Kubernetes Node 4 (m5.2xlarge)
│  ├─ RDS Standby Database (synchronous replication)
│  ├─ ElastiCache Redis Replica
│  └─ NAT Gateway
└─ AZ 3 (us-east-1c)
   ├─ Kubernetes Node 5 (m5.2xlarge)
   ├─ Kubernetes Node 6 (m5.2xlarge)
   └─ (Services spread across AZs)

Key requirements:

  1. No single point of failure per AZ (each service has replicas across AZs)
  2. Synchronous replication for data (writes must be replicated before acknowledged)
  3. Load balancer spans all AZs (distributes traffic evenly)
  4. Stateless services where possible (any server can handle any request)

Cost of multi-AZ: ~2–3x infrastructure cost (due to redundancy), but worth it for critical applications.

Database Replication and Failover Strategies

Databases are often the most critical infrastructure component and the most complex to keep highly available.

Replication topologies:

1. Primary-Standby (Master-Slave) Replication

Primary DB (us-east-1a)  [writes + reads]
    ↓ (async replication)
Standby DB (us-east-1b)  [reads only]
    ↓ (promotion on failure)
Promoted Primary (us-east-1b)  [new writes]

Characteristics:

  • RPO (Recovery Point Objective): ~1–5 seconds (async replication lag)
  • RTO (Recovery Time Objective): ~30–60 seconds (failover + DNS propagation)
  • Asymmetric: Primary handles writes; standby handles reads
  • Failover must be triggered (manual or automated)

2. Multi-Master (Active-Active) Replication

Primary DB 1 (us-east-1a)  [writes + reads]
    ↕ (bidirectional replication)
Primary DB 2 (us-east-1b)  [writes + reads]

Characteristics:

  • RPO: ~1–10 seconds (asynchronous or quorum-based)
  • RTO: ~0 seconds (automatic failover, no switchover needed)
  • Both nodes can accept writes (risk of conflicts)
  • Higher complexity (conflict resolution needed)

3. Quorum-Based Replication (PostgreSQL with Patroni)

PostgreSQL Primary (us-east-1a)
    ↓
PostgreSQL Standby 1 (us-east-1b)  
    ↓
PostgreSQL Standby 2 (us-east-1c)

Patroni (distributed consensus)
├─ Monitors all replicas
├─ Detects primary failure
└─ Promotes best standby automatically

For the organization, PostgreSQL HA with Patroni was chosen:

  • RPO: ~0.5–5 seconds (synchronous replication with quorum)
  • RTO: ~10–30 seconds (automated failover)
  • Automatic failover (no manual intervention needed)
  • Consistent read-after-write semantics

Load Balancing and Health Check Design

Load balancers distribute traffic across multiple servers and remove unhealthy servers from the pool.

Critical for HA: Health checks must be accurate and fast.

Bad health check: Checks only if TCP port is open (server could be hung but port is open)

Good health check: Sends actual HTTP request, expects specific response

// Good health check implementation
func healthHandler(w http.ResponseWriter, r *http.Request) {
    // Check critical dependencies
    checks := map[string]error{
        "database": healthCheckDB(),
        "cache":    healthCheckRedis(),
        "queue":    healthCheckKafka(),
    }
    
    allHealthy := true
    statusCode := http.StatusOK
    
    for name, err := range checks {
        if err != nil {
            allHealthy = false
            statusCode = http.StatusServiceUnavailable
            log.Printf("Health check failed for %s: %v", name, err)
        }
    }
    
    w.WriteHeader(statusCode)
    json.NewEncoder(w).Encode(map[string]interface{}{
        "healthy": allHealthy,
        "checks":  checks,
    })
}
go

Load balancer configuration (ALB):

# Terraform for ALB health checks
resource "aws_lb_target_group" "api_servers" {
  name     = "api-servers-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5  # seconds
    interval            = 10 # seconds, how often to check
    path                = "/health"
    matcher             = "200"  # expect 200 OK
  }

  deregistration_delay = 30  # graceful shutdown period
}
yaml

Circuit Breaker Implementation with Istio

A circuit breaker is a pattern that stops sending traffic to a failing service, allowing it to recover without being overwhelmed.

States:

  1. Closed: Normal operation, requests flow through
  2. Open: Too many errors, requests fail immediately (fast failure)
  3. Half-open: Test if service has recovered, allow some requests through

Istio implementation:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service-circuit-breaker
  namespace: production
spec:
  host: order-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100  # max concurrent connections
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutive5xxErrors: 5  # open circuit after 5 consecutive 5xx
      interval: 30s  # check interval
      baseEjectionTime: 30s  # how long to keep service out
      maxEjectionPercent: 50  # max % of hosts that can be ejected
      minRequestVolume: 5  # min requests before ejection
yaml

When the circuit breaker opens, clients get fast failures (immediately, no timeout waiting) and can fallback to alternate behavior:

# Client-side fallback
def get_order_details(order_id):
    try:
        return fetch_from_order_service(order_id, timeout=5)
    except CircuitBreakerOpen:
        # Service is failing, return cached data or degraded response
        return get_cached_order_details(order_id) or \
               {"id": order_id, "cached": True, "details_incomplete": True}
python

Bulkhead Pattern for Resource Isolation

The bulkhead pattern isolates critical resources to prevent a failure in one service from affecting others.

Example: Separate thread pools for different request types

// Without bulkheads: If order service has slow database queries,
// its thread pool fills up and blocks unrelated requests

// With bulkheads: Separate pools for different request types
ThreadPool criticalPool = new ThreadPool(50);      // Orders, payments
ThreadPool backgroundPool = new ThreadPool(20);    // Analytics, recommendations
ThreadPool internalPool = new ThreadPool(10);      // Internal admin requests

// Route requests to appropriate pool
if (request.isCritical()) {
    return criticalPool.execute(request);
} else if (request.isBackground()) {
    return backgroundPool.execute(request);
} else {
    return internalPool.execute(request);
}
java

In Kubernetes, bulkheads are implemented via:

  • Namespaces: Separate pods by function or team
  • Resource quotas: Limit CPU/memory per namespace
  • Pod priority: Critical pods get resources first
  • Network policies: Restrict traffic between services

2.2: Kubernetes Reliability Features

For the organization running 200+ microservices on Kubernetes, leveraging Kubernetes's built-in reliability features was essential.

Pod Disruption Budgets (PDB)

Kubernetes frequently needs to voluntarily disrupt pods for:

  • Node maintenance (OS updates, security patches)
  • Node scaling (removing underutilized nodes)
  • Pod evictions (enforcing resource quotas)

Pod Disruption Budget defines how many pod replicas can be disrupted simultaneously without violating availability guarantees.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
  namespace: production
spec:
  minAvailable: 2  # at least 2 order-service pods must be running
  selector:
    matchLabels:
      app: order-service
  unhealthyPodEvictionPolicy: AlwaysAllow
  
  # Alternative: specify maxUnavailable
  # maxUnavailable: 1  # at most 1 replica can be disrupted at a time
yaml

How PDBs work:

  1. Before evicting a pod, Kubernetes checks the associated PDB
  2. If evicting would violate minAvailable, Kubernetes postpones the eviction
  3. If evicting is allowed, pod is gracefully shut down (terminationGracePeriod window)

Example scenario:

  • Order service has 3 replicas, PDB minAvailable=2
  • Node needs maintenance
  • Kubernetes can evict only 1 pod (leaving 2 available)
  • Other 2 pods stay running, serving requests
  • Service remains available with reduced capacity

Liveness and Readiness Probes: Tuning for Reliability

Kubernetes uses health checks to know when a pod is healthy and ready for traffic.

Two probe types:

Liveness Probe: Is the pod alive? If not, restart it.

apiVersion: v1
kind: Pod
metadata:
  name: order-service-pod
spec:
  containers:
  - name: order-service
    image: order-service:v1.2.3
    
    # Liveness: restart if unhealthy
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 30  # wait 30s before first check
      periodSeconds: 10        # check every 10 seconds
      timeoutSeconds: 5        # wait 5s for response
      failureThreshold: 3      # restart after 3 consecutive failures
yaml

Readiness Probe: Is the pod ready to accept traffic?

    # Readiness: remove from load balancer if not ready
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5   # fast readiness check
      periodSeconds: 5         # check frequently
      timeoutSeconds: 3
      failureThreshold: 2      # remove from load balancer after 2 failures
yaml

Critical difference:

  • Liveness failure → pod is killed and restarted
  • Readiness failure → pod remains running but traffic is removed

Tuning considerations for 99.95% uptime:

  • initialDelaySeconds: Too short causes false failures on slow startups; too long delays detection. Set to 50–100% of typical startup time.
  • periodSeconds: Too long delays failure detection; too short creates probe traffic overhead. Typically 5–10 seconds.
  • failureThreshold: Lower threshold (2–3) reacts faster to failures; higher threshold (5+) tolerates transient hiccups. For 99.95%, recommend 2–3.

Bad example (causes false positives):

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 1  # WAY too aggressive, creates probe traffic storm
  failureThreshold: 1  # WAY too aggressive, restarts on single false positive
yaml

Resource Requests and Limits Optimization

Kubernetes schedules pods based on resource requests (guaranteed allocation) and limits (maximum).

Requests must be accurate for proper scheduling.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 10
  template:
    spec:
      containers:
      - name: order-service
        image: order-service:v1.2.3
        
        resources:
          requests:
            cpu: "500m"      # 0.5 CPU cores guaranteed
            memory: "512Mi"  # 512 MB guaranteed
          limits:
            cpu: "1000m"     # burst up to 1 CPU
            memory: "1Gi"    # burst up to 1 GB
yaml

Consequences of incorrect requests:

If requests are too low:

- Kubernetes thinks pod needs little resources
- Schedules too many pods on single node
- Node becomes CPU/memory constrained
- All pods on that node slow down (noisy neighbor problem)
- Service becomes unreliable due to starvation

If requests are too high:

- Kubernetes thinks pod needs lots of resources
- Schedules fewer pods per node
- Many nodes underutilized
- Higher infrastructure cost
- Less fault tolerance (fewer replicas fit in cluster)

Right-sizing methodology:

  1. Deploy service and let it run for 1 week
  2. Collect CPU/memory metrics (p95 utilization)
  3. Set request = p95 actual usage + 20% buffer
  4. Monitor for OOMKilled events (memory too low) or throttling (CPU too low)

Anti-Affinity Rules for Pod Distribution

Anti-affinity rules ensure pods are spread across nodes and AZs, preventing all replicas from being on the same failure domain.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 10
  template:
    spec:
      affinity:
        podAntiAffinity:
          # Try hard to spread pods across nodes (preferred)
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - order-service
              topologyKey: kubernetes.io/hostname
          
          # MUST spread pods across AZs (required)
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - order-service
            topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: order-service
        image: order-service:v1.2.3
        # ... rest of config
yaml

Topology keys explained:

  • kubernetes.io/hostname: Spread across nodes (host-level failure)
  • topology.kubernetes.io/zone: Spread across AZs (AZ-level failure)
  • topology.kubernetes.io/region: Spread across regions (region-level failure)

Comprehensive Kubernetes Deployment with All Reliability Features

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
  labels:
    app: order-service
    team: platform
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # max 2 extra pods during update
      maxUnavailable: 0  # zero pods unavailable during update
  
  selector:
    matchLabels:
      app: order-service
  
  template:
    metadata:
      labels:
        app: order-service
        version: v1.2.3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    
    spec:
      serviceAccountName: order-service
      securityContext:
        runAsNonRoot: true
        fsReadOnlyRootFilesystem: true
      
      # Pod anti-affinity for high availability
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - order-service
            topologyKey: topology.kubernetes.io/zone
      
      # Init container for migrations (before main container starts)
      initContainers:
      - name: db-migration
        image: order-service-migrate:v1.2.3
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: order-service-secrets
              key: database-url
      
      # Main application container
      containers:
      - name: order-service
        image: order-service:v1.2.3
        imagePullPolicy: IfNotPresent
        
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090
          protocol: TCP
        
        # Environment configuration
        env:
        - name: LOG_LEVEL
          value: "info"
        - name: DATABASE_POOL_SIZE
          value: "20"
        - name: CACHE_TTL
          value: "3600"
        
        # Secrets
        envFrom:
        - secretRef:
            name: order-service-secrets
        - configMapRef:
            name: order-service-config
        
        # Resource requests (for scheduling)
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        
        # Startup probe (for slow-starting apps)
        startupProbe:
          httpGet:
            path: /health/startup
            port: http
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30  # fail after 150 seconds
        
        # Liveness probe (restart if unhealthy)
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Readiness probe (remove from load balancer if not ready)
        readinessProbe:
          httpGet:
            path: /health/ready
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
        
        # Graceful shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Allow in-flight requests to complete
        
        # Volume mounts
        volumeMounts:
        - name: config
          mountPath: /etc/order-service/config
          readOnly: true
        - name: tmp
          mountPath: /tmp
        
        # Security
        securityContext:
          runAsNonRoot: true
          runAsUser: 1000
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
      
      # Volumes
      volumes:
      - name: config
        configMap:
          name: order-service-config
      - name: tmp
        emptyDir: {}
      
      # Graceful termination period
      terminationGracePeriodSeconds: 30
      
      # DNS policy
      dnsPolicy: ClusterFirst

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: order-service

---
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: production
  labels:
    app: order-service
spec:
  type: ClusterIP
  selector:
    app: order-service
  ports:
  - name: http
    port: 80
    targetPort: http
    protocol: TCP
  - name: metrics
    port: 9090
    targetPort: metrics
    protocol: TCP
  sessionAffinity: None
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
yaml

2.3: Stateful Service Reliability

While Kubernetes excels at managing stateless services, stateful services (databases, caches, message queues) require special attention.

PostgreSQL HA with Patroni

PostgreSQL is the organization's primary relational database. Patroni provides automatic failover and replicas management for high availability.

Architecture:

PostgreSQL Primary (postgres-0)
├─ Synchronous replica 1 (postgres-1)
├─ Synchronous replica 2 (postgres-2)
└─ Patroni cluster manager
   ├─ Distributed consensus (etcd)
   ├─ Automatic failover
   └─ Replica management

How Patroni works:

  1. Primary writes to replicas synchronously
  2. Patroni monitors primary health via distributed consensus (etcd)
  3. If primary becomes unresponsive (no heartbeat), Patroni automatically promotes best replica
  4. Newly promoted primary starts accepting writes
  5. Old primary (if it recovers) rejoins as replica

StatefulSet configuration:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: databases
spec:
  serviceName: postgres
  replicas: 3
  
  selector:
    matchLabels:
      app: postgres
  
  template:
    metadata:
      labels:
        app: postgres
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - postgres
            topologyKey: topology.kubernetes.io/zone
      
      serviceAccountName: postgres
      
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - name: postgresql
          containerPort: 5432
        
        env:
        - name: POSTGRES_DB
          value: production
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
        
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - pg_isready -U postgres
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - pg_isready -U postgres
          initialDelaySeconds: 5
          periodSeconds: 5
        
        volumeMounts:
        - name: postgresql-storage
          mountPath: /var/lib/postgresql/data
      
      - name: patroni
        image: patroni:latest
        env:
        - name: ETCD_HOSTS
          value: "etcd-0.etcd.databases:2379,etcd-1.etcd.databases:2379,etcd-2.etcd.databases:2379"
        - name: PATRONI_POSTGRESQL_PARAMETERS
          value: "synchronous_commit=on"  # Synchronous replication
        
        ports:
        - name: patroni
          containerPort: 8008
      
      volumes:
      - name: postgresql-storage
        emptyDir: {}
  
  volumeClaimTemplates:
  - metadata:
      name: postgresql-storage
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
yaml

Key parameters for 99.95% uptime:

ParameterValueRationale
Replicas3Can tolerate 1 failure; quorum requires 2/3
Synchronous replicas2Writes wait for 2 replicas; RPO = 0
Failover timeout10–15 secondsAuto-promotes best replica
Health check interval5 secondsDetects failure within 5 seconds
Replication lag monitoring<1 secondTriggers alert if replicas fall behind

Redis Sentinel for Cache High Availability

Redis caches frequently accessed data. While Redis failure doesn't corrupt data, it causes:

  • Increased database load (no cache)
  • Degraded performance
  • Potential cascading failures

Redis Sentinel provides automatic failover for Redis clusters.

Architecture:

Redis Master (redis-master)
├─ Replica 1 (redis-replica-1)
├─ Replica 2 (redis-replica-2)
└─ Sentinel cluster (3 sentinels)
   ├─ Monitors master health
   ├─ Detects failure
   └─ Promotes best replica automatically

Helm deployment:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: redis-sentinel
  namespace: databases
spec:
  repo: https://charts.bitnami.com/bitnami
  chart: redis
  version: "17.x"
  values:
    architecture: "replication"
    replica:
      replicaCount: 2
    sentinel:
      enabled: true
      quorum: 2
      downAfterMilliseconds: 5000
      failoverTimeout: 10000
    persistence:
      enabled: true
      size: "50Gi"
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "1000m"
        memory: "2Gi"
yaml

Client connection (automatic failover):

from redis.sentinel import Sentinel

# Sentinel automatically routes to current master
sentinel = Sentinel([('redis-sentinel-0', 26379),
                     ('redis-sentinel-1', 26379),
                     ('redis-sentinel-2', 26379)])

# Automatically fails over if master goes down
redis_master = sentinel.master_for('mymaster', socket_timeout=0.1)

# Use like normal Redis
redis_master.set('key', 'value')
value = redis_master.get('key')
python

Message Queue Clustering (Kafka/RabbitMQ)

Message queues must be highly available to avoid breaking asynchronous workflows.

Kafka cluster (recommended for this architecture):

apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-config
  namespace: queues
data:
  server.properties: |
    broker.id=0
    listeners=PLAINTEXT://0.0.0.0:9092
    advertised.listeners=PLAINTEXT://kafka-0.kafka.queues.svc.cluster.local:9092
    log.dirs=/var/lib/kafka-logs
    num.network.threads=8
    num.io.threads=8
    socket.send.buffer.bytes=102400
    socket.receive.buffer.bytes=102400
    socket.request.max.bytes=104857600
    
    # Replication
    default.replication.factor=3
    min.insync.replicas=2  # Wait for 2 replicas before acking write
    
    # Retention
    log.retention.hours=168
    log.segment.bytes=1073741824
    
    # ZooKeeper
    zookeeper.connect=zookeeper-0.zookeeper.queues.svc.cluster.local:2181,zookeeper-1.zookeeper.queues.svc.cluster.local:2181,zookeeper-2.zookeeper.queues.svc.cluster.local:2181

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: queues
spec:
  serviceName: kafka
  replicas: 3
  
  selector:
    matchLabels:
      app: kafka
  
  template:
    metadata:
      labels:
        app: kafka
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - kafka
            topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: kafka
        image: confluentinc/cp-kafka:7.4.0
        
        ports:
        - name: plaintext
          containerPort: 9092
        
        env:
        - name: KAFKA_BROKER_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: KAFKA_ADVERTISED_LISTENERS
          value: "PLAINTEXT://$(HOSTNAME).kafka.queues.svc.cluster.local:9092"
        
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
        
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/kafka-logs
  
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
yaml

Kafka producer (at-least-once delivery):

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka-0.kafka.queues:9092', 
                       'kafka-1.kafka.queues:9092',
                       'kafka-2.kafka.queues:9092'],
    acks='all',  # Wait for all in-sync replicas
    retries=3,
    compression_type='snappy'
)

# Publish with callback
def on_send_success(record_metadata):
    print(f"Sent to {record_metadata.topic} partition {record_metadata.partition}")

def on_send_error(exc):
    print(f"Error: {exc}")

future = producer.send('orders', {'order_id': 123, 'amount': 99.99})
future.add_callback(on_send_success)
future.add_errback(on_send_error)

producer.flush()
python

3: Observability Stack – Seeing Everything at Scale

Achieving 99.95% uptime requires complete visibility into the system. You cannot fix what you cannot see. This section covers the observability stack that enables visibility across 200+ microservices.

3.1: Metrics Collection & Analysis with Prometheus

Prometheus is a time-series database and monitoring system designed for Kubernetes and microservices.

Prometheus Architecture for 200+ Services

Scrape Targets (200+ services)
    ↓
Prometheus Server (time-series DB)
    ├─ Scrapes metrics every 15-30 seconds
    ├─ Stores in local TSDB
    ├─ Retains 15-30 days of data
    └─ Evaluates alerting rules
         ↓
         PagerDuty/Slack (firing alerts)

Remote Storage (optional)
    ↓ (long-term retention)
    S3/GCS (Thanos, Cortex, etc.)

Scale considerations:

  • 200 services × 100 metrics per service = 20,000 metric time series
  • At 15-second scrape interval: ~86,400 samples per second
  • Storage: ~50-100 GB per 15 days of retention

For reliability:

  • Multiple Prometheus replicas (3+) scraping same targets
  • Remote storage for long-term retention
  • Thanos for multi-replica deduplication and long-term querying

Prometheus federation (hierarchical scraping for large scale):

# Leaf Prometheus instances (one per team/product)
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-leaf-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      external_labels:
        cluster: us-east-1
        team: platform
    
    scrape_configs:
    # Scrape services owned by platform team
    - job_name: 'platform-services'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - platform
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_scrape]
        action: keep
        regex: 'true'

---
# Central Prometheus (scrapes from leaf instances)
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-central-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      external_labels:
        cluster: global
    
    scrape_configs:
    # Scrape from leaf Prometheus instances (federation)
    - job_name: 'leaf-prometheus'
      static_configs:
      - targets:
        - 'prometheus-platform.us-east-1:9090'
        - 'prometheus-products.us-east-1:9090'
        - 'prometheus-data.us-east-1:9090'
        - 'prometheus-infrastructure.us-east-1:9090'
yaml

Service-Level Metrics: RED Method

RED method defines three metrics for every service:

  1. Rate: Requests per second
  2. Errors: Error rate (failed requests)
  3. Duration: Request latency

These three metrics cover most reliability scenarios.

Instrumentation example in Python:

from prometheus_client import Counter, Histogram, Gauge
import time

# RED metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

# Helper decorator
def track_metrics(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        try:
            result = func(*args, **kwargs)
            status = 200
            return result
        except Exception as e:
            status = 500
            raise
        finally:
            duration = time.time() - start
            endpoint = func.__name__
            request_latency.labels(method='POST', endpoint=endpoint).observe(duration)
            request_count.labels(method='POST', endpoint=endpoint, status=status).inc()
    return wrapper

@track_metrics
def create_order(order_data):
    # Business logic
    return process_order(order_data)
python

Prometheus queries (PromQL) for RED metrics:

# Rate (requests per second)
rate(http_requests_total{job="order-service"}[5m])

# Error rate (errors as % of total)
rate(http_requests_total{job="order-service",status=~"5.."}[5m]) / 
rate(http_requests_total{job="order-service"}[5m])

# Latency (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Combined SLO check: availability (error rate < 0.1%)
(
  1 - (
    rate(http_requests_total{status=~"5.."}[5m]) / 
    rate(http_requests_total[5m])
  )
) > 0.999
promql

Infrastructure Metrics: USE Method

USE method tracks:

  1. Utilization: % of time resource is in use (0-100%)
  2. Saturation: Queue depth/tasks waiting
  3. Errors: Errors encountered

Key infrastructure metrics:

# CPU utilization
container_cpu_usage_seconds_total
container_cpu_throttling_seconds_total  # CPU throttling (oversubscribed)

# Memory utilization
container_memory_usage_bytes
container_memory_max_usage_bytes

# Disk usage
node_filesystem_avail_bytes  # Available disk space
container_fs_usage_bytes      # Container disk usage

# Network utilization
container_network_receive_bytes_total
container_network_transmit_bytes_total

# Disk I/O
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total

# Error metrics
node_disk_io_errs_total
node_network_receive_errs_total
yaml

PromQL for USE method monitoring:

# CPU utilization (should be <70% for safety)
rate(container_cpu_usage_seconds_total[5m]) * 100

# Memory utilization (should be <80%)
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100

# Disk utilization (alert at >85%)
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100

# Network saturation (dropped packets indicate saturation)
rate(node_network_receive_drop_total[5m])
promql

Custom Application Metrics

Beyond RED and USE, applications expose domain-specific metrics.

from prometheus_client import Gauge, Counter

# Domain-specific metrics for order service
orders_in_progress = Gauge(
    'orders_in_progress',
    'Number of orders currently being processed'
)

orders_completed_total = Counter(
    'orders_completed_total',
    'Total orders completed',
    ['status']  # 'success', 'cancelled'
)

payment_processing_time = Histogram(
    'payment_processing_seconds',
    'Time spent processing payments'
)

inventory_stock_level = Gauge(
    'inventory_stock_level',
    'Current stock level',
    ['product_id']
)

class OrderService:
    def process_order(self, order_id):
        orders_in_progress.inc()
        
        try:
            # Process payment
            start = time.time()
            payment_result = self.process_payment(order_id)
            payment_processing_time.observe(time.time() - start)
            
            # Update inventory
            self.update_inventory(order_id)
            
            orders_completed_total.labels(status='success').inc()
        except Exception as e:
            orders_completed_total.labels(status='failed').inc()
            raise
        finally:
            orders_in_progress.dec()
python

Metric Cardinality Management at Scale

A high-cardinality metric has many unique label value combinations, consuming excessive memory and disk space.

Bad example:

# ANTI-PATTERN: Using user ID as metric label
user_latency = Histogram(
    'request_latency_seconds',
    'Request latency per user',
    ['user_id']  # Could be millions of unique values!
)

# This creates millions of time series (one per user)
# Memory and disk explode
python

Good practice:

# Use bounded labels (few unique values)
request_latency = Histogram(
    'request_latency_seconds',
    'Request latency',
    ['service', 'endpoint', 'status_code'],  # Limited cardinality
    buckets=[...]
)

# For per-user metrics, use a separate index/database
# Query user metrics separately if needed
python

Cardinality monitoring:

# Alert when metric cardinality is too high
- alert: PrometheusHighMetricCardinality
  expr: count(count by (__name__) (up) > 10000) > 50
  for: 5m
  annotations:
    summary: "Prometheus instance has {{ $value }} high-cardinality metrics"
yaml

3.2: Centralized Logging with ELK Stack

While metrics answer "WHAT is happening?", logs answer "WHY?"

A single error might show up as a failed request metric, but logs reveal the root cause.

ELK Stack Architecture

Application Logs
    ↓
Filebeat (log shipper)
    ↓
Logstash (log processor/enricher)
    ↓
Elasticsearch (full-text search database)
    ↓
Kibana (visualization/query UI)

For 200+ services:

  • Each service writes logs to stdout (12-factor app)
  • Kubelet captures container logs
  • Filebeat reads logs and ships to Logstash
  • Logstash parses, enriches, filters logs
  • Elasticsearch stores for search
  • Kibana provides UI for searching and analyzing

Log volume at scale:

  • 200 services × 100-1000 log lines per second = 20K-200K logs/sec
  • At 500 bytes per log = 10-100 MB/sec ingestion rate
  • Per day: ~900 GB to 9 TB

Elasticsearch cluster sizing:

  • 10-30 data nodes (depending on retention)
  • Replication factor 2 (redundancy)
  • Retention: 7-30 days (older logs archival to S3)

Structured Logging Standards

Unstructured logs ("Something went wrong") are hard to search and analyze. Structured logs (JSON with fields) are queryable.

Example: Before (unstructured)

2025-12-10T12:34:56Z order-service: Error processing order 12345 for user john@example.com: database connection timeout

Hard to query: "How many order failures per user?"

Example: After (structured)

{
  "timestamp": "2025-12-10T12:34:56Z",
  "service": "order-service",
  "level": "ERROR",
  "request_id": "req-abc123def456",
  "order_id": 12345,
  "user_id": "user-789",
  "error_type": "database_timeout",
  "error_message": "database connection timeout after 5s",
  "duration_ms": 5234,
  "metadata": {
    "region": "us-east-1",
    "pod": "order-service-pod-5",
    "node": "k8s-node-12"
  }
}
json

Now easily queryable.

Structured logging in Python:

import json
import logging
import sys
from pythonjsonlogger import jsonlogger

# Configure JSON logging
logHandler = logging.StreamHandler(sys.stdout)
formatter = jsonlogger.JsonFormatter(
    '%(timestamp)s %(level)s %(name)s %(message)s %(request_id)s %(order_id)s'
)
logHandler.setFormatter(formatter)

logger = logging.getLogger()
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)

# Log with structured fields
logger.info(
    "Order processed",
    extra={
        'timestamp': '2025-12-10T12:34:56Z',
        'request_id': 'req-abc123',
        'order_id': 12345,
        'user_id': 'user-789',
        'duration_ms': 245,
        'status': 'success'
    }
)
python

Log Retention and Cost Optimization

Storage is the largest cost component of ELK. Optimize:

1. Log sampling: For high-volume services, log only a percentage of requests.

import random

def should_log(request_id):
    """Log 10% of requests in production, 100% in dev."""
    if env == 'dev':
        return True
    # Hash-based sampling for consistent trace context
    hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
    return (hash_val % 100) < 10  # 10% sample

if should_log(request_id):
    logger.info("Request details", extra=log_data)
python

2. Log retention tiers:

# Recent logs (7 days): Hot storage (fast, expensive)
# Warm logs (8-30 days): Warm storage (slower, cheaper)
# Archive (31+ days): S3 (very cheap)

elasticsearch.yml:
  index.lifecycle.name: order-service-logs
  index.lifecycle.rollover_alias: order-service-logs-write

ilm_policy:
  phases:
    hot:
      min_age: 0d
      actions:
        rollover:
          max_primary_shard_size: 50GB
    warm:
      min_age: 7d
      actions:
        set_priority:
          priority: 50
        forcemerge:
          max_num_segments: 1
    cold:
      min_age: 30d
      actions:
        searchable_snapshot:
          snapshot_repository: s3-repository
    delete:
      min_age: 90d
      actions:
        delete: {}
yaml

3. Field filtering: Don't log everything.

# ANTI-PATTERN: Logs credit card numbers
logger.info(f"Payment: {credit_card_number}")

# GOOD: Log hashed or masked values
logger.info(f"Payment: {credit_card_last_4}")
python

Search and Analysis Patterns

Common queries on logs for reliability:

Query 1: Errors by service (last hour)

GET logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"level": "ERROR"}},
        {"range": {"timestamp": {"gte": "now-1h"}}}
      ]
    }
  },
  "aggs": {
    "by_service": {
      "terms": {"field": "service", "size": 50}
    }
  }
}
json

Query 2: Slow requests (p99 latency)

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
promql

Query 3: Trace all requests for a user during an incident

GET logs-*/_search
{
  "query": {
    "match": {"user_id": "user-123"}
  },
  "sort": [{"timestamp": {"order": "asc"}}]
}
json

3.3: Distributed Tracing with Jaeger

With 200+ microservices, a single user request touches many services. Distributed tracing shows the entire path through all services.

Example trace:

User Request (5000ms total)
├─ API Gateway (100ms)
├─ Auth Service (150ms)
├─ Order Service (800ms)
│  ├─ Validate Order (100ms)
│  ├─ Product Service (250ms)  ← slow!
│  │  └─ Database Query (200ms)
│  ├─ Inventory Service (150ms)
│  └─ Payment Service (300ms)
│     └─ External Payment API (280ms)  ← very slow!
└─ Notification Service (200ms)

Distributed tracing shows exactly where latency comes from.

Jaeger Implementation

from jaeger_client import Config
import logging

def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',  # Sample 100% (can reduce in prod)
                'param': 1,
            },
            'logging': True,
            'local_agent': {
                'reporting_host': 'jaeger-agent.monitoring',
                'reporting_port': 6831,
            },
        },
        service_name=service_name,
        validate=True,
    )
    return config.initialize_tracer()

tracer = init_tracer('order-service')

# Instrument a function
def process_order(order_id):
    with tracer.start_active_span('process_order') as scope:
        span = scope.span
        span.set_tag('order_id', order_id)
        
        # Nested spans
        with tracer.start_active_span('validate_order') as nested:
            validate_order(order_id)
        
        with tracer.start_active_span('call_payment_service') as nested:
            payment_result = call_payment_service(order_id)
            nested.span.set_tag('payment_status', payment_result['status'])
        
        return {'success': True}
python

Trace sampling strategy:

# Sample traces intelligently
class ProbabilisticSampler:
    def __init__(self, initial_rate=0.001):
        self.rate = initial_rate
    
    def should_sample(self, trace_id, error_occurred=False):
        if error_occurred:
            return True  # Always sample errors
        
        if is_slow_request(trace_id):
            return True  # Always sample slow requests
        
        # Sample normal requests at low rate
        return random.random() < self.rate
    
    def adjust_rate(self, error_rate):
        # If error rate is high, increase sampling
        if error_rate > 0.01:
            self.rate = min(0.1, self.rate * 2)
        else:
            self.rate = max(0.001, self.rate / 2)
python

Finding Performance Bottlenecks

Jaeger dashboards show:

  1. Critical path: Longest span in trace (the bottleneck)
  2. Span dependency: Which services call which
  3. Latency heatmap: Distribution of request latencies

3.4: Alerting Strategy – Right Sizing Alerts

With proper monitoring, the next step is alerting. But bad alerting creates alert fatigue.

Alert Fatigue Prevention

Alert fatigue occurs when:

  • Too many alerts fire → engineers ignore them → critical alerts missed
  • Alerts are flaky (fire on transient issues) → engineers lose trust
  • Alerts require manual investigation (no context) → slow MTTR

Consequences of alert fatigue:

  • 60% of alerts are acknowledged but not acted on
  • 70% of ignored alerts were not actionable
  • Alert fatigue correlates with worse incident response

Tiered Alerting System (P0–P4)

Not all alerts are equal. Severity tiers ensure critical issues get attention while non-critical issues don't create noise.

SeverityExamplesActionEscalation
P0 (Critical)Service completely unavailable, production data loss, SLA breachPage on-call immediately (wake up)Escalate to on-call manager after 10 min
P1 (Urgent)Service degraded >50%, SLO breached, high error ratePage on-call, respond within 15 minEscalate if not acknowledged in 10 min
P2 (High)Service degraded 10-50%, elevated error rate, elevated latencyPage on-call, respond within 30 minCreate ticket if not acknowledged in 20 min
P3 (Medium)Elevated but within SLO, resource warning, non-critical path affectedSend Slack notification, investigate when availableCreate ticket
P4 (Low)Informational, suggestions for improvement, FYILog to dashboards, review during office hoursNo immediate action required

Example alert configuration:

groups:
- name: order-service-alerts
  rules:
  
  # P0: Complete outage
  - alert: OrderServiceDown
    expr: up{job="order-service"} == 0
    for: 1m  # Alert after 1 minute (quick response)
    severity: P0
    annotations:
      summary: "Order service is completely down"
      description: "No order-service instances are responding"
      runbook: "https://wiki.company.com/runbooks/order-service-down"
  
  # P1: High error rate
  - alert: OrderServiceHighErrorRate
    expr: |
      rate(http_requests_total{job="order-service",status=~"5.."}[5m]) /
      rate(http_requests_total{job="order-service"}[5m]) > 0.05
    for: 3m  # Allow transient spikes
    severity: P1
    annotations:
      summary: "Order service error rate >5%"
      description: "Error rate: {{ $value | humanizePercentage }}"
  
  # P2: Elevated latency
  - alert: OrderServiceHighLatency
    expr: |
      histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="order-service"}[5m])) > 1.0
    for: 5m
    severity: P2
    annotations:
      summary: "Order service p95 latency >1s"
  
  # P3: Database connection pool exhausted
  - alert: OrderServiceDBPoolExhausted
    expr: |
      pg_stat_activity_count{database="orders",state="active"} /
      pg_settings_max_connections > 0.8
    for: 10m
    severity: P3
    annotations:
      summary: "Order service database connection pool 80% exhausted"
  
  # P4: Informational
  - alert: OrderServiceHighMemory
    expr: |
      container_memory_usage_bytes{pod=~"order-service.*"} /
      container_spec_memory_limit_bytes > 0.7
    for: 15m
    severity: P4
    annotations:
      summary: "Order service memory usage >70%"
yaml

On-Call Rotation and Escalation

Proper on-call management:

  1. Rotation: Fair distribution of on-call duties
  2. Escalation: If primary on-call doesn't acknowledge in X minutes, escalate to manager
  3. Runbooks: Clear procedures for common alerts
  4. Blameless culture: Focus on fixing problems, not blaming on-call engineer

Example on-call schedule:

Week 1: Alice (primary), Bob (backup)
Week 2: Bob (primary), Charlie (backup)
Week 3: Charlie (primary), Dave (backup)
Week 4: Dave (primary), Alice (backup)

Escalation:
- Alert fires
- Page primary on-call (Alice)
- Alice has 5 minutes to acknowledge
- After 5 min, page manager
- Manager can escalate further to VP of Engineering

Actionable vs Informational Alerts

Bad alert (not actionable):

- alert: HighMemory
  expr: node_memory_MemAvailable_bytes < 1000000000
  annotations:
    summary: "High memory usage detected"
    # Engineer sees this alert: Now what?
yaml

Good alert (actionable):

- alert: HighMemoryPressure
  expr: node_memory_MemAvailable_bytes < 1000000000
  for: 10m
  severity: P2
  annotations:
    summary: "Node {{ $labels.node }} memory available < 1 GB"
    description: "This usually indicates a memory leak or misconfiguration"
    runbook: "https://wiki/memory-pressure-runbook"
    dashboards: 
      - "https://grafana/d/memory-details/{{ $labels.node }}"
    context:
      container_count: "{{ query('count(container_memory_usage_bytes{node=\"{{ $labels.node }}\"}') }}"
      top_containers: "List top 3 containers by memory"
yaml

Actionable runbook:

# Memory Pressure Runbook

## What's happening?
Node {{ $labels.node }} has less than 1 GB available memory, indicating:
- Pods are using more memory than expected
- Possible memory leak in a container
- Insufficient node capacity for workload

## Quick investigation
1. SSH to node: `ssh {{ $labels.node }}`
2. Check memory usage: `free -h`
3. Check top memory consumers: `docker stats --no-stream`
4. Check for memory leaks: `docker logs <container> | grep OOM`

## Actions to take
- **Short term**: Kill largest container or evict least critical pod
- **Medium term**: Add more nodes or resize existing nodes
- **Long term**: Right-size pod resource requests based on actual usage

## Escalate if
- Multiple nodes under memory pressure
- Unable to free memory
markdown

4: Self-Healing Automation – Reducing MTTR

Traditional monitoring detects problems; self-healing fixes them automatically without human intervention.

4.1: Automated Remediation

The goal: Reduce MTTR (Mean Time To Recovery) from 30+ minutes to <5 minutes through automation.

Kubernetes Self-Healing via Restart Policies

Kubernetes natively handles many failure scenarios:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  template:
    spec:
      # Restart policy for automatic recovery
      restartPolicy: Always  # Always restart failed pods
      
      containers:
      - name: order-service
        image: order-service:v1.2.3
        
        # Readiness probe: remove from load balancer if unhealthy
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          failureThreshold: 2
          periodSeconds: 5
        
        # Liveness probe: restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          failureThreshold: 3
          periodSeconds: 10
yaml

How this self-heals:

  1. Pod becomes unhealthy (readiness probe fails)
  2. Kubernetes immediately removes pod from service load balancer (users not affected)
  3. If pod still unhealthy after failureThreshold × periodSeconds:
    • Liveness probe kills pod
    • RestartPolicy causes immediate restart
    • Fresh container starts
  4. If new container becomes ready → back in load balancer
  5. User impact: <5 seconds outage for that pod

Automatic recovery of common failures:

FailureSelf-Healing MechanismMTTR
Pod crashRestartPolicy: Always<10 seconds
OOM errorReadiness probe detects, pod evicted, new pod scheduled<30 seconds
Deadlock in appLiveness probe detects stuck pod, restart<20 seconds
Database connection timeoutReadiness probe detects, traffic removed<5 seconds
Memory leak (gradual)Liveness probe based on memory threshold, restartDepends on threshold

Custom Operators for Complex Recovery

For scenarios beyond Kubernetes defaults, Kubernetes operators (custom controllers) automate recovery.

Example: Database failover operator

# Custom operator: automatically promotes replica if primary fails

from kopf import (
    on, Timer, FieldFunction,
    patch, adopt, index
)
import logging

log = logging.getLogger()

@on.event(
    'postgresql.io', 'v1', 'postgresqlcluster',
    labels={'ha-enabled': 'true'}
)
def ensure_primary_alive(spec, name, namespace, **kwargs):
    """Check if primary is alive; promote replica if not."""
    
    primary_pod = f"{name}-primary"
    primary_namespace = namespace
    
    # Check if primary is healthy
    if not is_primary_healthy(primary_pod, primary_namespace):
        log.info(f"Primary {primary_pod} is unhealthy, promoting replica")
        
        # Find best replica
        replicas = get_replicas(name, namespace)
        best_replica = select_best_replica(replicas)
        
        # Promote replica
        promote_replica(best_replica, primary_namespace)
        
        # Update DNS/load balancer to point to new primary
        update_route(f"{name}-primary", best_replica)
        
        # Old primary becomes replica when it recovers
        demote_to_replica(primary_pod, primary_namespace)
        
        log.info(f"Promoted {best_replica} to primary")

@on.event(
    'postgresql.io', 'v1', 'postgresqlcluster',
    labels={'ha-enabled': 'true'},
    initial=False
)
def monitor_replication_lag(spec, name, namespace, **kwargs):
    """Alert if replication lag gets too high."""
    
    replicas = get_replicas(name, namespace)
    
    for replica in replicas:
        lag = get_replication_lag(replica)
        
        if lag > 10000:  # 10 seconds
            log.warning(f"High replication lag on {replica}: {lag}ms")
            trigger_alert('high_replication_lag', {
                'replica': replica,
                'lag_ms': lag
            })
python

Automated Rollback on Health Check Failure

When a new deployment causes errors, automatically rollback to previous version:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  annotations:
    # Custom annotation for rollback operator
    auto-rollback: "true"
    health-check-threshold: "0.05"  # Rollback if error rate >5%
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0  # Zero unavailability
  
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: order-service:v1.2.4
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          failureThreshold: 3
yaml

Operator that monitors deployment health:

@on.update(
    'apps', 'v1', 'deployment',
    labels={'auto-rollback': 'true'},
    field=FieldFunction('status.conditions[?type=="Progressing"]'),
    old_value={'status': 'True'},  # Was progressing
    new_value={'status': 'False'}   # Now not progressing (stuck)
)
def rollback_on_failure(spec, status, name, namespace, **kwargs):
    """Rollback deployment if it gets stuck."""
    
    # Get current error rate
    error_rate = get_current_error_rate(name)
    threshold = float(spec['annotations'].get('health-check-threshold', '0.05'))
    
    if error_rate > threshold:
        log.warning(f"Deployment {name} has high error rate {error_rate:.2%}")
        
        # Get previous revision
        prev_revision = get_previous_revision(name, namespace)
        
        if prev_revision:
            log.info(f"Rolling back to revision {prev_revision}")
            
            # Rollback
            patch_deployment(name, namespace, {
                'spec': {
                    'template': {
                        'metadata': {
                            'labels': {
                                'revision': prev_revision
                            }
                        }
                    }
                }
            })
            
            trigger_alert('deployment_rollback', {
                'deployment': name,
                'error_rate': error_rate,
                'prev_revision': prev_revision
            })
python

Pod Autoscaling with Custom Metrics

Horizontal Pod Autoscaler (HPA) automatically scales pods based on metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  
  minReplicas: 5   # Minimum instances
  maxReplicas: 50  # Maximum instances
  
  metrics:
  
  # Scale based on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up at 70% CPU
  
  # Scale based on memory
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale up at 80% memory
  
  # Scale based on custom metric (queue depth)
  - type: Pods
    pods:
      metric:
        name: queue_depth
      target:
        type: AverageValue
        averageValue: "50"  # Scale up when avg queue > 50 items
  
  # Behavior (scaling speed)
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50  # Scale down by max 50% at a time
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Scale up by max 100% at a time
        periodSeconds: 30
yaml

Custom metric collection (Prometheus → HPA):

# Application exposes queue depth metric
from prometheus_client import Gauge

order_queue_depth = Gauge(
    'order_queue_depth',
    'Number of orders waiting in queue'
)

# HPA queries this metric and scales accordingly
# When queue_depth > threshold, HPA adds more pods
python

4.2: Chaos Engineering – Learning from Failures

Chaos engineering is the discipline of injecting failures intentionally to test system resilience and discover weaknesses before they cause real outages.

Controlled Failure Injection Methodology

Rather than hoping failures don't happen, chaos engineers deliberately break things in controlled ways:

  1. Hypothesis: "If the payment service goes down, orders should fail gracefully with user-friendly error message"
  2. Experiment: Kill the payment service
  3. Observation: What actually happens?
  4. Learning: Use observations to improve architecture

Chaos Mesh for Kubernetes

Chaos Mesh is a chaos engineering platform for Kubernetes:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-order-service-pod
  namespace: chaos
spec:
  action: pod-failure  # Kill pod
  mode: fixed
  value: 1  # Kill 1 pod
  
  selector:
    namespaces:
    - production
    labelSelectors:
      app: order-service
  
  duration: 5m  # Run for 5 minutes
  scheduler:
    cron: "0 11 * * 1-5"  # Run on weekdays at 11 AM (business hours)

---
# Network latency chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: add-order-service-latency
  namespace: chaos
spec:
  action: delay  # Add network delay
  mode: percentage
  percentage: 50  # Affect 50% of traffic
  
  selector:
    namespaces:
    - production
    labelSelectors:
      app: order-service
  
  delay:
    latency: "500ms"  # Add 500ms latency
    jitter: "100ms"
  
  duration: 10m

---
# Packet loss chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: packet-loss-payment-service
  namespace: chaos
spec:
  action: loss  # Drop packets
  mode: percentage
  percentage: 20  # Lose 20% of packets
  
  selector:
    namespaces:
    - production
    labelSelectors:
      app: payment-service
  
  loss:
    loss: "20%"
  
  duration: 5m

---
# Disk failure chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: database-io-error
  namespace: chaos
spec:
  action: latency  # Add I/O latency
  mode: fixed
  value: 1
  
  selector:
    namespaces:
    - databases
    labelSelectors:
      app: postgres
  
  latency: "500ms"
  duration: 3m
yaml

Game Days and Disaster Recovery Drills

Game day: Scheduled chaos engineering event where team practices incident response.

Example game day schedule:

9:00 AM - Team briefing, review objectives
9:15 AM - Chaos Mesh starts killing random pods
9:30 AM - First alert fires, team responds
10:00 AM - Escalated to VP of Engineering (test escalation path)
10:30 AM - Chaos stops, assessment begins
11:00 AM - Retrospective: What went wrong? What worked? What to improve?
12:00 PM - Post-mortem writeup and action items

Chaos scenarios for 99.95% uptime:

ScenarioChaosExpected BehaviorResult
Pod failureKill 1 order-service podService remains available (8 other pods)✅ PASS
Service dependency failureKill entire payment-serviceOrders fail with graceful error✅ PASS
Database primary failureKill primary PostgreSQLAutomatic failover to replica (<30s)✅ PASS / ⚠️ Manual failover needed
Network latencyAdd 1s latency to API callsRequests timeout, circuitbreaker trips❌ FAIL - No fallback
Cascading failureKill API gateway + Auth serviceEntire system unavailable❌ FAIL - Need backup gateway

4.3: Blue-Green & Canary Deployments

Deployments are a major source of incidents. Blue-green and canary strategies reduce deployment risk to near-zero.

Zero-Downtime Deployment Strategies

Blue-Green Deployment:

Before:
┌─────────────────────────┐
│   Load Balancer         │
└───────────┬─────────────┘
            │
    ┌───────┴────────┐
    ↓                ↓
┌────────────┐  ┌────────────┐
│  Blue v1.2 │  │  Green v1.2│  (inactive)
│ (active)   │  │            │
└────────────┘  └────────────┘

After deployment (v1.3 ready):
┌─────────────────────────┐
│   Load Balancer         │
└───────────┬─────────────┘
            │
    ┌───────┴────────┐
    ↓                ↓
┌────────────┐  ┌────────────┐
│  Blue v1.2 │  │  Green v1.3│  (new version)
│ (active)   │  │ (validated)│
└────────────┘  └────────────┘

After switch:
    (single atomic switch via load balancer)
┌─────────────────────────┐
│   Load Balancer         │
└───────────┬─────────────┘
            │
    ┌───────┴────────┐
    ↓                ↓
┌────────────┐  ┌────────────┐
│  Blue v1.2 │  │  Green v1.3│  (active)
│            │  │ (rollback  │
│(ready to   │  │ ready)     │
│rollback)   │  │            │
└────────────┘  └────────────┘

If issue found: One switch back to Blue (v1.2)
No gradual shift → no partial failures

Kubernetes blue-green:

# Blue deployment (active)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-blue
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: order-service
      slot: blue
  template:
    metadata:
      labels:
        app: order-service
        slot: blue
    spec:
      containers:
      - name: order-service
        image: order-service:v1.2.3  # Current version

---
# Green deployment (new version, not receiving traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-green
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: order-service
      slot: green
  template:
    metadata:
      labels:
        app: order-service
        slot: green
    spec:
      containers:
      - name: order-service
        image: order-service:v1.2.4  # New version

---
# Service routes to blue (active)
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: production
spec:
  selector:
    app: order-service
    slot: blue  # Currently points to blue
  ports:
  - port: 80
    targetPort: 8080

---
# Deployment procedure:
# 1. Deploy green with new version (no traffic yet)
# kubectl apply -f green-deployment.yaml
#
# 2. Test green (internal traffic, synthetic tests)
# kubectl port-forward svc/order-service-green 8080:80
#
# 3. Once validated, switch service to green:
# kubectl patch service order-service -p '{"spec":{"selector":{"slot":"green"}}}'
#
# 4. If issue: switch back to blue
# kubectl patch service order-service -p '{"spec":{"selector":{"slot":"blue"}}}'
#
# 5. Once confident: update blue to new version, switch to blue, retire green
yaml

Canary Releases with Automated Rollback

Canary deployment gradually shifts traffic instead of atomic switch:

0 min: Blue 100%, Green 0%
5 min: Blue 95%, Green 5%
10 min: Blue 90%, Green 10%
15 min: Blue 75%, Green 25%
20 min: Blue 50%, Green 50%
25 min: Blue 25%, Green 75%
30 min: Blue 0%, Green 100%

During this time, monitor Green's error rate, latency, etc.
If metrics degrade, automatically rollback

Flagger + Istio canary:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: order-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  
  # Service mesh control
  service:
    port: 80
    targetPort: 8080
  
  analysis:
    interval: 1m  # Check metrics every minute
    threshold: 5  # Max 5 consecutive failed checks before rollback
    maxWeight: 50  # Max 50% traffic to canary
    stepWeight: 10  # Increase traffic by 10% every iteration
    
    # Success criteria
    metrics:
    - name: http-success-rate
      thresholdRange:
        min: 99  # Must stay >99% success
      interval: 1m
    
    - name: http-request-duration
      thresholdRange:
        max: 500  # p99 latency must stay <500ms
      interval: 1m
    
    # Custom metrics
    - name: error_rate
      query: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))
      thresholdRange:
        max: 0.05  # Error rate <5%
  
  # Webhook for additional validation
  skipAnalysis: false  # Run analysis before rollout
  
  # Automated rollback
  skipFinalCheck: false
  
  # Alert notification
  alerts:
  - name: PagerDuty
    severity: critical
    provider:
      type: pagerduty
      address: https://pagerduty.example.com
yaml

How Flagger works:

  1. New version (canary) is deployed alongside current version (stable)
  2. Service mesh (Istio) gradually shifts traffic to canary
  3. Flagger queries metrics (success rate, latency) from Prometheus
  4. If metrics stay healthy: continue shifting traffic
  5. If metrics degrade: immediately rollback to previous version
  6. If full shift succeeds: promote canary to stable, retire old version

A/B Testing Infrastructure

While canary tests for regressions, A/B testing validates new features:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
  namespace: production
spec:
  hosts:
  - order-service
  http:
  # Route 50% to stable (v1.2), 50% to canary (v1.3)
  - match:
    - sourceLabels:
        user-cohort: "treatment"  # Users in A/B test
    route:
    - destination:
        host: order-service
        port:
          number: 80
        subset: v1-3
      weight: 100  # All treatment users get v1.3
  
  # Control group gets stable version
  - route:
    - destination:
        host: order-service
        port:
          number: 80
        subset: v1-2  # Stable version
      weight: 100

---
# Track A/B test metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-v1-3
spec:
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "9090"
    spec:
      containers:
      - name: order-service
        env:
        # Tag all metrics with experiment ID
        - name: EXPERIMENT_ID
          value: "exp_new_checkout_flow_2025-12"
        - name: EXPERIMENT_GROUP
          value: "treatment"
yaml

Analyzing A/B test results:

# Compare conversion rate between control and treatment
# Treatment (new checkout flow)
conversion_rate_treatment = 
  sum(rate(order_completed{experiment_group="treatment"}[1h])) /
  sum(rate(order_initiated{experiment_group="treatment"}[1h]))

# Control (stable)
conversion_rate_control = 
  sum(rate(order_completed{experiment_group="control"}[1h])) /
  sum(rate(order_initiated{experiment_group="control"}[1h]))

# If conversion_rate_treatment > conversion_rate_control:
#   Roll out new flow to all users
# Else:
#   Investigate why new flow underperforms
promql

5: Incident Response & Postmortems

Despite best efforts, incidents still occur. How you respond determines whether an hour of downtime becomes a catastrophe or a learning opportunity.

Incident Command Structure

Effective incident response requires clear roles:

RoleResponsibilities
Incident Commander (IC)Coordinates response, makes decisions, communicates status
Technical LeadDiagnoses and fixes the problem
Communications LeadUpdates stakeholders (customers, management, team)
ScribeDocuments timeline, decisions, and actions taken

Incident levels:

SEV-1 (Critical)
├─ Complete service outage or severe degradation
├─ Customer-facing impact
├─ Requires immediate response
└─ Page all on-call engineers

SEV-2 (High)
├─ Service degraded but partially functional
├─ Impact limited to subset of customers/features
├─ Requires quick response
└─ Page on-call engineer

SEV-3 (Medium)
├─ Service has issues but workaround exists
├─ Minimal impact to customers
└─ Can wait for next business day if after-hours

SEV-4 (Low)
├─ Informational or very limited impact
└─ Log and address during normal work

Runbook Development

Runbooks are step-by-step procedures for responding to common incidents.

Example: Database Replication Lag Runbook

# Database Replication Lag Incident

## Alert
- Alert: `DatabaseReplicationLagHigh`
- Threshold: >30 seconds
- Severity: P2

## Quick diagnosis
1. Check current replication lag:
   ```sql
   SELECT now() - pg_last_wal_receive_lsn() AS replication_lag;
markdown
  1. Check replica state:

    SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;
    
    sql
  2. Check network traffic between primary and replicas:

    iftop -i eth0 -n  # see bandwidth to replica
    
    bash

Common causes and fixes

Cause 1: Network latency

  • High replication lag but replica is responsive
  • Fix: Check network connectivity
    ping <replica-ip>
    mtr <replica-ip>  # traceroute with packet loss
    
    bash

Cause 2: Replica falling behind due to heavy writes

  • High write volume on primary
  • Replica CPU/disk I/O maxed out
  • Fix:
    1. Reduce write load on primary (pause non-critical writes)
    2. Scale up replica (bigger instance type)

Cause 3: Replica crashed or hung

  • Lag keeps increasing
  • Replica unresponsive
  • Fix:
    1. Kill hung process: kill -9 <pid>
    2. Restart PostgreSQL: systemctl restart postgresql
    3. Monitor recovery: tail -f /var/log/postgresql/postgresql.log

When to escalate

  • Cannot reduce replication lag below 10 seconds within 15 minutes
  • Replica completely unreachable (network issue)
  • Data corruption detected

Escalation path

  • Escalate to Database Team Lead if not resolved in 15 min
  • Contact AWS Support if infrastructure issue suspected

### Real Incident Case Study (Anonymized)

**Incident: Order Processing Delayed - Case Study**

**Timeline:**

14:32 - Monitoring alert: "Order processing latency >5s" (P2) Engineering on-call (Alice) paged

14:35 - Alice acknowledges alert, starts investigation Checks dashboard: Average order processing time 8.5s (vs normal 200ms) Starts incident in Slack: #incident-channel

14:37 - Alert severity escalates to P1: "Order success rate dropping" Error rate 15% (threshold: 5%) Communication lead (Bob) paged to notify customers

14:39 - Alice identifies bottleneck: "payment-service" has 10s+ latency Creates trace in Jaeger to see what's happening

14:42 - Alice calls Technical Lead (Charlie) to understand payment-service Charlie: "We haven't changed anything. Let me check." Checks payment-service logs: "Database connection pool exhausted"

14:45 - Charlie: "Order-service might be leaking database connections" Checks order-service connections: 485/500 max connections in use Connections not being returned to pool

14:47 - Root cause identified: "Database connection leak in order-service" Connection pool slowly fills up, starves new requests

14:48 - Quick mitigation: Restart order-service pods to clear connection pool Immediate improvement: latency drops to 500ms, error rate drops to 0.5%

14:51 - Second-order issue: Payment service still slow Charlie: "Payment-service also has connection leak" Restart payment-service pods

14:54 - System fully recovered All metrics normal Alice: "Incident resolved. Postmortem in 2 hours."

15:00 - Bob notifies customers: "Issue resolved, thank you for patience"

18:00 - Postmortem meeting (Alice, Charlie, Bob, 2 engineers from both teams)


**Timeline diagram:**

14:30 ├─ Latency alert fires 14:32 ├─ Alice paged, investigates 14:35 ├─ Error rate increases, P1 alert 14:37 ├─ Communications alert, customers notified 14:42 ├─ Root cause identified: connection leak 14:45 ├─ Mitigation: restart order-service 14:48 ├─ Payment-service also needs restart 14:54 ├─ Full recovery └─ Total incident duration: 22 minutes


### Blameless Postmortem Process

**Postmortem goal**: Learn from incident, improve systems so it doesn't happen again.

**NOT to assign blame** - blame erodes psychological safety and prevents honest learning.

**Postmortem template:**

```markdown
# Postmortem: Order Processing Outage - 2025-12-10 14:32-14:54

## Summary
- Duration: 22 minutes
- Impact: Order processing delayed, 15% error rate
- Affected customers: ~2,000 active users
- Estimated revenue impact: ~$15,000

## Timeline
- 14:32: Latency alert fires
- 14:37: Error rate exceeds threshold, P1 alert triggered
- 14:42: Root cause identified (connection leak)
- 14:48: Payment service restarted, recovery begins
- 14:54: Full recovery, all metrics normal

## Root Cause Analysis

### Primary cause
Connection pool leak in order-service and payment-service:
```python
# BEFORE (buggy code)
def process_order(order_id):
    connection = db_pool.get_connection()
    try:
        return process_payment(order_id, connection)
    except Exception:
        pass  # BUG: Connection never returned on exception

# AFTER (fixed)
def process_order(order_id):
    connection = db_pool.get_connection()
    try:
        return process_payment(order_id, connection)
    finally:
        db_pool.return_connection(connection)  # Always return

Why it wasn't caught earlier

  1. Connection leak is gradual (accumulates over hours)
  2. No automated pool exhaustion alert (added today)
  3. Code review didn't catch exception path
  4. No load testing to trigger the issue

Impact

  • Customers unable to place orders for 22 minutes
  • ~500 orders failed
  • Customer support team fielded complaints
  • Reputation impact: Some customers expressed frustration on Twitter

What went well

  1. Alert response: Paged on-call engineer within 2 minutes
  2. Investigation: Root cause identified within 10 minutes
  3. Mitigation: Restart pods quickly restored service
  4. Communication: Customer communication timely and honest
  5. Postmortem: Team focused on learning, not blame

What could have been better

  1. No connection pool exhaustion alert (NEW: Added P2 alert)
  2. Connection pool leak not caught in code review
  3. No integration tests for exception paths (NEW: Added test)
  4. No load testing in staging before deploy

Action items

PriorityActionOwnerDue
P0Add connection pool exhaustion alertSRE-Alice2025-12-11
P1Add integration test for connection cleanup on exceptionEng-Charlie2025-12-13
P1Add load testing to staging pipelineInfra-Dave2025-12-17
P2Review all database connection pools for similar leaksEng-Team2025-12-20
P3Add connection pool metrics to dashboardSRE-Alice2025-12-24

Lessons learned

  1. Connection leaks are insidious: gradual failure, no obvious cause
  2. Missing metrics (pool exhaustion) delayed diagnosis
  3. Multiple services with same bug = need systematic code review
  4. Load testing in staging would have caught this

Process improvements

  • Implement automatic load testing before deploy
  • Add resource exhaustion alerts for all pools (DB, HTTP, thread, etc.)
  • Require exception handling code review checklist
  • Monthly "connection leak" audit across all services

6: Results & Reliability Metrics

After 12 months of implementing the practices outlined above, the organization achieved:

Uptime Improvement Timeline: 98.2% → 99.95%

Month 1 (Baseline): 98.2% uptime (19.4 hours downtime)
├─ Incidents: 3 major
├─ MTTR: 42 minutes
└─ MTBF: 6 days

Month 2-3 (Architecture): 98.7% uptime
├─ Multi-AZ deployment completed
├─ Database failover automated
├─ Incidents: 2 major
└─ MTTR: 35 minutes

Month 4-6 (Monitoring): 99.1% uptime
├─ Prometheus + Grafana deployed
├─ Alerting rules refined
├─ Incidents: 1 major, 2 minor
└─ MTTR: 18 minutes

Month 7-9 (Self-Healing): 99.5% uptime
├─ Automated pod restart policies
├─ Custom operators for complex recovery
├─ Incidents: 1 minor, several prevented
└─ MTTR: 8 minutes

Month 10-12 (Chaos & Culture): 99.95% uptime
├─ Chaos engineering practices
├─ Blameless postmortems
├─ Preventive measures
├─ Incidents: 0 major (prevented)
└─ MTTR: 3 minutes (when incidents occur)

MTTR and MTBF Optimization

MetricBaselineAfter 12 MonthsImprovement
MTTR (Mean Time To Recovery)45 minutes3 minutes93% reduction
MTBF (Mean Time Between Failures)6 days90 days1,400% improvement
Incident frequency (per month)4–50–180% reduction
Incident severity (avg)P1/P2P3/P4Mostly prevented
Customer impact (incidents/customers)2,000–5,000<10095% reduction

Monthly Uptime Comparison

MonthUptimeDowntimeIncidentsStatus
2025-0198.2%19.4h4Baseline
2025-0298.5%14.2h3Arch changes
2025-0398.9%10.6h2Monitoring
2025-0499.0%8.7h2Alerting tuning
2025-0599.2%5.8h1Self-healing
2025-0699.3%5.0h1Operators
2025-0799.5%3.6h1Chaos
2025-0899.6%2.9h0Culture
2025-0999.7%2.2h0Prevented
2025-1099.85%1.1h0Sustained
2025-1199.9%0.7h0Sustained
2025-1299.95%0.4h0Target achieved

Cost of Reliability Investment vs Downtime Prevented

Investment breakdown:

CategoryCost
Engineering (5 FTE × $150K × 1 year)$750,000
Infrastructure (HA, multi-region, redundancy)$200,000
Tools (Prometheus, Grafana, Jaeger, Chaos Mesh)$50,000
Training and hiring$30,000
Total investment$1,030,000

Downtime prevented:

MetricValue
Downtime reduced (126h → 26.3h)99.7h
Cost per hour of downtime$5,600
Downtime cost prevented$558,320
Also prevented
Lost customer subscription revenue$800,000+
Brand reputation damage$500,000+
SLA penalties$200,000+
Total value created$2,058,320+

ROI: (Total value - Investment) / Investment = ($2,058,320 - $1,030,000) / $1,030,000 = 100% ROI in first year

Plus: Ongoing value of maintaining 99.95% uptime = $5+ million/year in prevented downtime.


Conclusion: Building a Reliability Culture

Achieving 99.95% uptime is not primarily a technical problem. While this article covered technology (Kubernetes, Prometheus, Istio, etc.), the real foundation is cultural.

Key Patterns with Biggest Impact

1. Shared ownership of reliability across all teams, not just "ops"

  • Product teams own their service's reliability SLO
  • Engineering teams write reliability tests
  • Decision-making includes reliability trade-offs

2. Error budgets make reliability economically rational

  • Teams no longer view reliability as constraint on velocity
  • Instead: "We have 26 hours of downtime budget per year. How do we use it?"
  • Natural incentive: use that budget for feature development, not accidents

3. Psychological safety enables honest incident postmortems

  • Blame culture → engineers hide failures → repeat mistakes
  • Blameless culture → engineers learn from failures → prevent recurrence
  • Google found 5x higher MTTR when postmortems are blame-focused

4. Automation over heroics

  • Don't hire better on-call engineers; reduce incidents and MTTR via automation
  • Heroic 3am incident recovery is exhausting and unsustainable
  • Self-healing systems allow engineers to focus on preventing problems

5. Observability enables prevention

  • Traditional monitoring detects fires; observability predicts them
  • Traces show bottlenecks before users complain
  • Logs reveal root causes faster
  • Metrics enable proactive scaling

When to Accept Lower Reliability

99.95% uptime is not always optimal. Consider lower targets when:

Internal/non-critical systems: Development infrastructure, analytics dashboards, internal tools

  • Accept 99% uptime (87 hours/year downtime)
  • Focus resources on customer-facing systems
  • Still need some reliability (prevent major outages) but not 99.95%

Prototype/MVP stages: New product being validated with early customers

  • Accept 99% uptime initially
  • Graduate to 99.95% once product-market fit confirmed
  • Avoid over-engineering before product is proven

Cost-benefit doesn't justify: For very low-traffic services

  • 99.95% uptime costs $1M/year
  • If service generates only $500K/year, 99.0% is more rational
  • Optimize resource allocation to high-value services

User base can't absorb: Niche products with <1,000 users

  • 99.95% uptime = 4.4 hours downtime/year = 0.37 hours/month
  • For small user base, manual fixes adequate
  • Automate only when scale justifies investment

Future Reliability Investments

Beyond 99.95%, moving toward 99.99% requires:

1. Global distribution (eliminate single-region failures)

  • Multi-region active-active setup
  • Automatic failover between regions
  • Cost: ~3-5x infrastructure

2. Deeper observability (prevent invisible failures)

  • Continuous synthetic testing (every second from multiple locations)
  • Automated chaos engineering (daily controlled failures)
  • Distributed tracing for every request (not sampled)

3. Smarter automation (faster remediation)

  • ML-based anomaly detection (find problems humans miss)
  • Predictive scaling (scale before metrics degrade)
  • Automated root cause analysis

4. Business continuity (survive even extreme failures)

  • Cross-region database consistency
  • Backup payment processors
  • Manual override procedures for when automation fails

Appendix: Technical Artifacts

Complete Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'us-east-1'

scrape_configs:

  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
    - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https

  # All pods (service discovery)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: 'true'
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

# Alert rules
rule_files:
  - '/etc/prometheus/alert-rules.yaml'
yaml

Complete Alert Rules

groups:
- name: kubernetes-alerts
  rules:
  
  # Pod not ready
  - alert: PodNotReady
    expr: min(kube_pod_status_ready{condition="false"}) by (pod, namespace) == 1
    for: 10m
    severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} not ready"
  
  # High error rate
  - alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
        /
        sum(rate(http_requests_total[5m])) by (job)
      ) > 0.05
    for: 5m
    severity: critical
    annotations:
      summary: "{{ $labels.job }} error rate > 5%"
      value: "{{ $value | humanizePercentage }}"
  
  # High latency
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1.0
    for: 5m
    severity: warning
    annotations:
      summary: "{{ $labels.job }} p95 latency > 1s"
      value: "{{ $value }}s"
  
  # Database connection pool exhausted
  - alert: DBConnectionPoolExhausted
    expr: |
      pg_stat_activity{state="active"} / pg_settings_max_connections > 0.8
    for: 5m
    severity: critical
    annotations:
      summary: "Database connection pool {{ $labels.datname }} > 80% full"
yaml

Final Thoughts

Achieving 99.95% uptime with 200+ microservices is hard work. It requires:

  • Technical excellence: Multi-AZ redundancy, automated failover, observability
  • Operational discipline: Clear incident procedures, blameless postmortems, continuous improvement
  • Cultural foundation: Shared responsibility, psychological safety, error budgets
  • Sustained investment: Not a one-time project but ongoing discipline

The organizations that achieve this level of reliability gain:

  • Competitive advantage: Reliability becomes a selling point
  • Team morale: Engineers proud of systems, not exhausted by incidents
  • Financial stability: Downtime prevented = revenue protected + brand protected

For any team operating critical infrastructure, the principles in this article apply regardless of scale. Start small (99.9% uptime), master the fundamentals, then evolve toward higher targets.

Published on 12/3/2025 by Yogesh Bhandari

Found this helpful? Share it with your network!

Share:🐦💼

Yogesh Bhandari

Technology Visionary & Co-Founder

Building the future through cloud innovation, AI solutions, and open-source contributions.

CTO & Co-Founder☁️ Cloud Expert🚀 AI Pioneer
© 2025 Yogesh Bhandari.Made with in Nepal

Empowering organizations through cloud transformation, AI innovation, and scalable solutions.

🌐 Global Remote☁️ Cloud-First🚀 Always Building🤝 Open to Collaborate