Cloud Cost Optimization at Scale: A $2.8M Reverse-Engineering Case Study
A detailed case study on how a high-growth SaaS company reverse-engineered their $5.4M annual cloud spend, identified inefficiencies across compute, storage, and networking, and achieved a 52% cost reduction ($2.8M in annual savings) through systematic optimization, intelligent right-sizing, and architectural redesign. Includes step-by-step technical implementation, code snippets, and a replicable FinOps framework.
Tech Stack:
Introduction: The Hidden Cloud Cost Crisis in Enterprise Organizations
The cloud has fundamentally transformed how organizations build, scale, and operate technology infrastructure. Gone are the days of multi-year capital expenditure planning, massive upfront datacenter investments, and rigid hardware provisioning cycles. The elasticity, on-demand pricing, and global reach of cloud providers like Amazon Web Services, Microsoft Azure, and Google Cloud Platform have democratized infrastructure and enabled rapid innovation at scale.
However, this same flexibility and powerβthe ability to provision resources instantly and scale seamlesslyβhas created an invisible crisis lurking in the financial statements of thousands of enterprises: massive, systematic overspending on cloud infrastructure, often ranging from 30% to 50% above optimal levels.
The Paradox of Cloud Economics
This phenomenon is deeply counterintuitive. Organizations invest millions of dollars in cloud infrastructure, expecting automatic cost savings compared to on-premises datacenters. Yet, in practice, what often happens is:
- Cloud adoption begins with enthusiasm and urgency. Teams provision resources quickly to meet business timelines, often without deep cost consideration.
- Early success and growth reinforce loose cost discipline. "We're saving money versus on-prem anyway," leadership reasons.
- As the organization scales, resource proliferation becomes invisible. Hundreds or thousands of cloud resources accumulate across dozens of teams, regions, and product lines.
- Billing data arrives monthly in voluminous, incomprehensible reports. Finance teams see total spend but lack the technical insight to optimize. Engineering teams lack visibility into their own cost footprint.
- By the time cost crisis becomes undeniableβoften when cloud spend rivals the cost of the entire product development organizationβthe optimization surface is vast and poorly understood.
The result: a classic tragedy of the commons in cloud economics. No single team feels ownership for overall cloud efficiency. Individual teams optimize for speed and feature velocity. Organizations end up paying for idle infrastructure, over-provisioned resources, redundant services, and inefficient data flows.
A Real-World Crisis: The $5.4M Cloud Bill
Consider a composite but realistic scenario based on engagements with multiple high-growth technology companies:
A Series B or Series C SaaS company with 150β300 engineers across multiple product lines finds themselves on an unexpected trajectory. Cloud spending, which seemed reasonable a year ago at $1.2M annually, has ballooned to $5.4M per year.
This is approximately $1,080 per month per engineer, or about $360 per month per customer (assuming 15,000 active customers). For comparison, many SaaS companies operate at cloud costs closer to $50β150 per customer per month.
The situation feels precarious:
- CFO and board begin asking uncomfortable questions: "Why are we spending more on infrastructure than on sales?"
- Engineering leadership realizes they have almost no visibility into where the money goes.
- Individual teams suspect their own infrastructure is efficient but cannot prove it.
- Previous cost optimization attempts (usually a few quick fixes) yielded only marginal savings.
- The default assumption in leadership meetings: "Cloud is expensive; this is just the cost of doing business."
The Challenge: Optimization Without Sacrifice
Reducing cloud costs is trivial if you're willing to accept severe consequences:
- Delete everything: Sure, cost goes to zero, but so does the product.
- Massively reduce capacity: Eliminate non-production environments, reduce redundancy, shut down global regionsβand watch reliability and developer productivity plummet.
- Migrate to on-premises: Reintroduce massive capital costs, inflexibility, and operational burden.
The real challengeβthe one that separates amateur cost-cutting from genuine optimizationβis reducing costs while:
- Maintaining or improving performance for end customers.
- Preserving reliability and redundancy.
- Enabling developer productivity (fast feedback loops, comprehensive environments, quick deployments).
- Supporting business growth (global expansion, new product lines, new customer segments).
- Maintaining security and compliance (encryption, audit trails, isolated environments).
In other words: optimize the infrastructure, not the business.
The Outcome: $2.8M in Annual Savings (52% Reduction)
Through systematic analysis, technical implementation, architectural redesign, and cultural change, the organization achieved:
- $2.8M in annual savings, reducing cloud spend from $5.4M to $2.6M.
- 52% cost reduction across the infrastructure.
- Improved performance for end customers (faster API responses, better data processing pipelines).
- Enhanced reliability through smarter resource allocation and better autoscaling.
- Preserved developer velocity by keeping non-production environments and development tooling intact.
- Established sustainable cost optimization practices, with machinery in place for ongoing improvement.
This was not achieved through heroic one-time effort, but through a structured, data-driven, technically rigorous approach that combined:
- Deep forensic analysis of the cost structure
- Right-sizing and elimination of waste (quick wins)
- Architectural redesign for cost efficiency
- Automation of cost monitoring and optimization
- Cultural and organizational changes around cost awareness
Why Cost Optimization Is a Technical Problem, Not Just Financial
A critical reframe: cloud cost optimization is fundamentally a technical problem disguised as a financial problem.
Many organizations approach cost optimization as:
- A finance initiative, led by CFOs and business analysts.
- A procurement exercise, focused on negotiating better rates with cloud vendors.
- An annual budget cycle ritual, where teams are asked to cut costs and do their best.
This perspective misses the core insight: the structure of your infrastructureβthe architecture, design patterns, operational practices, and automationβis the primary driver of cost, often far more than negotiated rates or bulk discounts.
To illustrate:
- Over-provisioned EC2 instances run with 10% CPU utilization but are billed for 100% of their capacity. Right-sizing can save 40β60% per instance without performance impact.
- Idle databases queried only during business hours consume reserved capacity 24/7. Scheduling stops can save 50% on non-production RDS instances.
- Inefficient data access patterns require expensive data transfers across regions, redundant caching, or excessive database queries. Better architectural design can reduce data transfer costs by 70%+.
- Sprawl of stale resources (unused S3 buckets, detached EBS volumes, forgotten Lambda functions) accumulate waste imperceptibly until suddenly they're costing hundreds of thousands.
- Lack of cost visibility means teams build without understanding their cost implications. Adding a cost feedback loop often yields 10β15% savings through behavior change alone.
None of these are finance problems. All of them are technical and architectural problems.
Therefore, successful cost optimization requires:
- Technical leadership (CTOs, architects, platform engineers) driving the initiative.
- Engineering rigor applied to cost the same way it is applied to performance, reliability, and security.
- Visibility and instrumentation into cloud costs at a level of granularity previously unusual (per-team, per-service, per-request).
- Architectural thinking about trade-offs between cost, performance, reliability, and developer experience.
With this frame in mind, the rest of this article details the technical, architectural, and organizational practices that enabled the $2.8M savings.
1: The Cost Discovery Phase β Understanding the Baseline
Before you can optimize, you must understand. The first phase of the engagement was a comprehensive forensic analysis of the organization's cloud spend, designed to answer questions like:
- Where exactly is the $5.4M going?
- Which teams are responsible for the largest cost centers?
- Which resources are actually used vs. idle or zombie resources?
- Which services offer the best opportunities for optimization?
- What is the cost structure of each major application or product line?
This phase typically took 4β6 weeks and required close collaboration between finance, engineering, and platform/DevOps teams.
Initial Cost Audit Methodology
The cost audit began with a structured, multi-layered approach to understand cloud spending:
Layer 1: High-level categorization of spend across the main cost centers:
- Compute (EC2, EKS, Lambda): VMs, containerized workloads, serverless functions
- Storage (S3, EBS, RDS databases): Object storage, block storage, database volumes
- Data Transfer (inter-region, Internet egress): Network costs, often underestimated
- Managed Services (RDS, DynamoDB, ElastiCache, etc.): Specialized services
- Third-party/SaaS (monitoring, security, development tools): Often hidden in cloud bills
- Other (CloudFront, Route53, Elastic IPs, etc.): Miscellaneous charges
Layer 2: Deep dive by service into the top 5β10 cost drivers, understanding:
- Historical spend trends (month-over-month, year-over-year)
- Seasonal patterns (higher compute during peak customer months, lower on weekends)
- Growth trajectory (is this cost center growing faster than the business?)
Layer 3: Attribution and ownership by business unit, team, or application:
- Which teams own which resources?
- Can costs be traced to revenue-generating products vs. overhead?
- Are there obvious cost anomalies or unexplained spikes?
Layer 4: Benchmarking against industry norms:
- Cloud cost per customer or per revenue dollar
- Compute cost as a percentage of total cloud spend
- Data transfer as a percentage
This multi-layered approach ensures both breadth (understanding the full scope) and depth (understanding root causes).
AWS Cost Explorer Analysis Approach
AWS Cost Explorer is the starting point for most AWS-centric organizations. It provides:
- Cost and usage data across all AWS services
- Ability to break down costs by dimension (service, region, instance type, tag, etc.)
- Some built-in forecasting
- Access to underlying data for programmatic analysis
Key reports generated during the audit:
Report 1: Service-level cost breakdown
- Table of all AWS services by monthly spend
- Ranked by cost, showing month-over-month trends
Typical findings:
- EC2 (compute) often represents 30β45% of spend
- RDS (managed databases) often 15β25%
- S3 and storage 10β20%
- Data transfer 5β15% (often surprisingly large)
- Managed services (Redis, Elasticsearch, etc.) 5β10%
Report 2: Instance type analysis
- Breakdown of EC2 costs by instance family and size
- Identifying outliers (e.g., why are we running
x1.32xlargeinstances whent3.largewould suffice?)
Report 3: Reserved Instance (RI) analysis
- Current RI purchases and utilization
- On-Demand compute that could be covered by RIs
- RI purchase recommendations based on historical usage
Report 4: Regional cost distribution
- Cost by AWS region
- Identifying whether cost distribution aligns with traffic distribution
- Opportunities for consolidation or re-homing
Identifying Cost Centers: Compute, Storage, Data Transfer, Third-Party Services
Through Cost Explorer and deeper analysis, the typical high-growth SaaS company structure emerges:
Compute Costs ($2.2M/year in this example)
The largest cost center, typically broken down as:
-
Production EKS cluster(s): 40β50% of compute
- Multiple node groups (on-demand for critical workloads, spot for batch)
- Running dozens to hundreds of microservices
-
Development/staging EKS clusters: 15β20% of compute
- Often over-provisioned (built to handle peak load but used at average)
- Running 24/7 even during low-usage periods
-
EC2 instances (non-containerized): 10β15%
- Legacy systems, data pipelines, specialized workloads
-
Lambda functions: 5β10%
- Development tools, scheduled tasks, event-driven workloads
Common inefficiencies identified:
- Oversized instances: t3.2xlarge with 10% CPU utilization β could use t3.medium
- Idle development environments: running 24/7 but used during business hours only
- Underutilized Reserved Instances: organization bought RIs but still runs excess on-demand
- Expensive instance types: using memory-optimized (
r5) for workloads that need only general-purpose
Storage Costs ($890K/year)
Broken down typically as:
-
RDS (managed databases): 40β50%
- Multi-AZ deployments with over-provisioned storage
- Inefficient data retention policies (keeping full backups far longer than necessary)
-
S3: 25β35%
- Mix of Standard, Infrequent Access (IA), and Glacier
- Inefficient lifecycle policies (data not transitioning to cheaper tiers)
-
EBS volumes: 10β15%
- Unattached volumes (forgotten after instance termination)
- Snapshots (often old, no longer needed)
-
Other (DocumentDB, Elasticsearch, etc.): 5β10%
Common inefficiencies:
- Unattached EBS volumes: Costing money but not in use
- Database over-provisioning: RDS instances sized for peak load but used at average
- Inefficient S3 lifecycle policies: Data staying in expensive Standard tier when it should move to IA or Glacier
- Excessive database backups: Multi-year retention when 90-day retention would suffice
Data Transfer Costs ($420K/year)
Often the most surprising and controllable cost center:
- CloudFront data out: 30β40%
- Serving static and dynamic content to users globally
- EC2-to-Internet egress: 20β30%
- Direct API calls, webhook deliveries, third-party API calls
- Inter-region transfers: 15β25%
- Replication, disaster recovery, multi-region failover
- NAT Gateway data: 10β20%
- Private subnets sending traffic through NAT
Common inefficiencies:
- Missing CloudFront: Large data transfer going directly from EC2 to Internet instead of through CDN
- Inefficient VPC architecture: EC2-to-EC2 traffic crossing region boundaries unnecessarily
- Lack of VPC endpoints: NAT Gateway charges for traffic that could be free via VPC endpoints
Third-Party and Miscellaneous Costs ($370K/year)
- Monitoring and observability tools (Datadog, New Relic, etc.): Often $50Kβ100K/year
- Security and compliance tools
- Development tools (CI/CD, artifact registries, etc.)
- Licensing (commercial AMIs, commercial databases)
Building a Cost Attribution Model by Team/Product
A critical insight from this engagement: unattributed costs are optimized by no one. If a team doesn't know their cloud cost, they have no incentive to optimize it.
The engagement included building a cost attribution model that assigned every AWS cost to an owning team or product line.
Methodology:
-
Create a tagging strategy: Ensure all resources are tagged with:
team: owning team nameproduct: product line or business unitenvironment: dev/staging/prodapplication: specific service or workload
-
Enforce tagging at provisioning time: Use AWS tagging policies and compliance checks to ensure 95%+ resource coverage.
-
Implement cost allocation tags in AWS Cost Explorer to break down costs by team.
-
Export cost data to a data warehouse for deeper analysis.
-
Create dashboards showing cost per team, cost per customer (for revenue-generating products), and cost trends.
Example findings:
| Team/Product | Monthly Spend | Compute | Storage | Data Transfer | Per-Customer Cost |
|---|---|---|---|---|---|
| Product A (revenue-generating) | $180K | $100K | $50K | $30K | $12/customer |
| Product B (revenue-generating) | $120K | $60K | $40K | $20K | $8/customer |
| Data Pipelines (internal) | $85K | $70K | $10K | $5K | N/A |
| Platform/Infra | $65K | $40K | $15K | $10K | N/A |
| Development/Staging | $95K | $85K | $5K | $5K | N/A |
This kind of transparency was revelatory for the organization. It became clear that:
- Development environments were consuming as much as a revenue-generating product ($95K vs $120Kβ180K)
- Data pipelines had limited visibility into their efficiency
- Some products had much higher per-customer cloud costs than others
Hidden Costs: Idle Resources, Over-Provisioning, and Sprawl
The forensic analysis uncovered categories of invisible waste:
Idle and Zombie Resources
Idle databases: RDS instances provisioned for specific projects but still running months after projects ended.
- Example: A test database with
db.r5.2xlarge(64 GB RAM) running at 5% utilization. - Cost: $3,500/month
- Action: Delete (or downsize if needed) β Save $3,500/month Γ 12 = $42K/year
Unattached EBS volumes: Volumes that were part of EC2 instances that were terminated, but the volumes persisted.
- Example: 1,240 unattached EBS volumes across all regions
- Average size: 100 GB
- Cost per volume: ~$10/month
- Total cost: 1,240 Γ $10 = $12,400/month ($148,800/year)
- Action: Delete old snapshots, consolidate volumes β Save ~$100K/year
Unused S3 buckets: Development buckets, old application buckets no longer in use.
- Example: 87 S3 buckets, 20 of which have not been accessed in 90+ days
- Average size: 500 GB
- Cost if kept in Standard: ~$11.50/month per bucket
- Total: 20 Γ $11.50 = $230/month
- Action: Move to Glacier or delete β Save ~$2.8K/year
While individual items are small, the aggregate of sprawl is significant.
Over-Provisioned Instances
Over-provisioned compute instances:
- Example: Database workload running on
db.r5.4xlarge(128 GB RAM, 16 vCPU) with:- Average CPU: 15%
- Average RAM utilization: 28%
- Cost: $7,008/month
- Could run on
db.r5.large(16 GB RAM, 2 vCPU) at ~$876/month - Savings: $6,132/month ($73.6K/year)
Over-provisioned Kubernetes node pools:
- Example: Development cluster with:
- 20 nodes of
m5.2xlarge(32 GB RAM each) - Average cluster-wide utilization: 25%
- Cost: ~$52K/month
- Could run on 10 nodes of
m5.xlargeat ~$13K/month - Savings: ~$39K/month ($468K/year)
- 20 nodes of
Multi-AZ Complexity
Database multi-AZ deployments in development and staging environments:
- Production environments: Justified (high availability, minimal downtime)
- Development environments: Often unnecessary (downtime is acceptable, can rebuild)
- Cost impact: Multi-AZ typically doubles database cost
- Example: Development RDS β Single-AZ reduces cost from $1,500 to $750/month
Cost Breakdown Table: Top 20 Cost Line Items Before Optimization
| Rank | Service/Resource | Monthly Cost | Annual Cost | Category | Utilization | Priority |
|---|---|---|---|---|---|---|
| 1 | Production EKS Cluster (On-Demand) | $78,400 | $940,800 | Compute | 65% | HIGH |
| 2 | RDS Multi-AZ Production | $32,100 | $385,200 | Storage | 60% | MEDIUM |
| 3 | Data Transfer (EC2 egress) | $28,300 | $339,600 | Network | 85% | HIGH |
| 4 | Development EKS Cluster | $18,900 | $226,800 | Compute | 22% | CRITICAL |
| 5 | RDS Staging Multi-AZ | $15,200 | $182,400 | Storage | 35% | HIGH |
| 6 | S3 Standard Tier | $14,500 | $174,000 | Storage | 90% | MEDIUM |
| 7 | Lambda Functions | $9,800 | $117,600 | Compute | 70% | LOW |
| 8 | CloudFront Distribution | $8,900 | $106,800 | Network | 95% | LOW |
| 9 | RDS Read Replicas | $8,200 | $98,400 | Storage | 45% | MEDIUM |
| 10 | Development Database Servers | $7,400 | $88,800 | Storage | 18% | CRITICAL |
| 11 | Unattached EBS Volumes | $6,200 | $74,400 | Storage | 0% | CRITICAL |
| 12 | DataDog Monitoring | $5,600 | $67,200 | Third-party | 100% | LOW |
| 13 | VPC NAT Gateways | $5,100 | $61,200 | Network | 78% | HIGH |
| 14 | ElastiCache Redis Cluster | $4,800 | $57,600 | Storage | 55% | MEDIUM |
| 15 | Elasticsearch Domains | $4,300 | $51,600 | Storage | 42% | MEDIUM |
| 16 | EC2 Spot Instances | $3,900 | $46,800 | Compute | 88% | LOW |
| 17 | DynamoDB Provisioned | $3,600 | $43,200 | Storage | 35% | MEDIUM |
| 18 | RDS Dev Database | $3,200 | $38,400 | Storage | 12% | CRITICAL |
| 19 | SSL Certificates (ACM) | $2,100 | $25,200 | Security | 100% | LOW |
| 20 | Route53 DNS | $1,900 | $22,800 | Network | 100% | LOW |
| TOTAL (Top 20) | $231,800 | $2,781,600 | ||||
| Other services/resources | $83,200 | $998,400 | ||||
| GRAND TOTAL | $315,000 | $3,780,000 |
Note: This table represents ~70% of the $450K/month ($5.4M/year) total spend. The remaining 30% is distributed across hundreds of smaller line items.
Cost Discovery Insights and Key Findings
From this initial phase, several critical insights emerged:
Finding 1: Development environments are structurally wasteful
Development/staging infrastructure was provisioned to handle production-like peak loads but ran at average utilization. Additionally, they ran 24/7 even though usage was concentrated during business hours (9 AMβ6 PM).
- Potential savings from scheduling: ~$35K/month ($420K/year)
- Potential savings from right-sizing: ~$18K/month ($216K/year)
Finding 2: Database workloads are substantially over-provisioned
Across RDS, DynamoDB, and Elasticsearch, utilization metrics (CPU, memory, I/O) were consistently 30β60% of provisioned capacity. Right-sizing to match actual demand could yield significant savings.
- Potential savings: ~$22K/month ($264K/year)
Finding 3: Data transfer costs are significantly under-managed
Data transfer, often invisible in initial billing reviews, represented the third-largest cost category. Many opportunities existed to:
-
Route more traffic through CloudFront (free from cache vs. paid from EC2)
-
Use VPC endpoints to avoid NAT Gateway charges
-
Consolidate multi-region replication
-
Potential savings: ~$12K/month ($144K/year)
Finding 4: Sprawl and waste accumulation is substantial
Unattached volumes, stale snapshots, unused resources, and zombie projects account for a surprising volume of wasteβoften 8β12% of total spend.
- Potential savings: ~$8K/month ($96K/year)
Finding 5: Cost visibility was nearly zero
Before this engagement, most teams had no idea of their own cost impact. No team owned cost optimization. Resources were provisioned based on technical needs, not cost implications.
- Expected impact of implementing cost visibility and team-level chargeback: ~5β10% behavioral savings
These findings set the stage for the deeper technical analysis phase.
2: Technical Analysis Framework β Dissecting the Cost Drivers
With the baseline established, the engagement shifted to deeper technical analysis of each major cost center, understanding:
- Why was each resource sized as it was?
- What is the actual utilization vs. provisioned capacity?
- What alternatives exist?
- What are the trade-offs?
This phase relied heavily on CloudWatch metrics, Cost Explorer APIs, and custom analysis scripts to build a detailed technical picture.
2.1: Compute Optimization Analysis β EC2/EKS
Compute (EC2 and containerized workloads on EKS) represented the largest single cost category at ~$2.2M/year. Understanding and optimizing this required detailed analysis.
EC2/EKS Node Utilization Analysis
The key question: Given our workload, what is the minimum compute capacity we actually need?
Method 1: CloudWatch metrics analysis
For each EC2 instance or Kubernetes node, extract historical metrics:
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
def get_instance_utilization(instance_id, days=30):
"""Get CPU and memory utilization for an EC2 instance over the past N days."""
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
# Get CPU utilization
cpu_response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600, # 1-hour granularity
Statistics=['Average', 'Maximum']
)
cpu_points = cpu_response['Datapoints']
avg_cpu = sum(p['Average'] for p in cpu_points) / len(cpu_points) if cpu_points else 0
max_cpu = max((p['Maximum'] for p in cpu_points), default=0)
# Get network metrics (optional)
network_response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='NetworkIn',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Sum']
)
return {
'instance_id': instance_id,
'avg_cpu': avg_cpu,
'max_cpu': max_cpu,
'cpu_datapoints': len(cpu_points)
}
pythonFor RDS databases, similar metrics were extracted:
def get_rds_utilization(db_instance_id, days=30):
"""Get CPU, memory, and I/O utilization for an RDS instance."""
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
metrics_to_check = [
'CPUUtilization',
'DatabaseConnections',
'ReadIOPS',
'WriteIOPS',
'NetworkReceiveThroughput',
'NetworkTransmitThroughput'
]
utilization = {'instance_id': db_instance_id}
for metric in metrics_to_check:
response = cloudwatch.get_metric_statistics(
Namespace='AWS/RDS',
MetricName=metric,
Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': db_instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
datapoints = response['Datapoints']
utilization[metric] = {
'avg': sum(p['Average'] for p in datapoints) / len(datapoints) if datapoints else 0,
'max': max((p['Maximum'] for p in datapoints), default=0)
}
return utilization
pythonKey findings from utilization analysis in this engagement:
| Instance Type | Count | Avg CPU | Max CPU | Status | Recommendation |
|---|---|---|---|---|---|
| t3.2xlarge | 12 | 8% | 22% | Severely underutilized | Downsize to t3.large |
| m5.2xlarge | 8 | 18% | 45% | Underutilized | Consider m5.xlarge |
| r5.4xlarge (RDS) | 6 | 25% | 60% | Moderately underutilized | Downsize to r5.2xlarge |
| c5.4xlarge | 4 | 72% | 89% | Well-utilized | Keep or consider c5.9xlarge for peaks |
| m5.large | 24 | 65% | 82% | Well-utilized | Keep (good fit) |
Right-Sizing Methodology
Right-sizing means matching instance type and size to actual workload requirements. The process:
Step 1: Collect baseline metrics
- Gather 30β90 days of CloudWatch metrics (CPU, memory, network, disk I/O).
- For RDS, also capture connections and query performance metrics.
Step 2: Identify patterns and peaks
- Daily patterns: Peak hours vs. off-peak
- Weekly patterns: Weekday vs. weekend
- Monthly patterns: End-of-month higher load
- Growth trajectory: Is utilization trending up or down?
Step 3: Define safe downsizing criteria
- p95 or p99 utilization (not average), to account for peak demand
- For most workloads: if p95 CPU is below 60% and p95 memory is below 70%, downsizing is usually safe
- Consider headroom for traffic spikes and unexpected events
Step 4: Select new instance type
- Match CPU, memory, and network to actual peak needs
- Often a downsize of 1β2 generations is possible (e.g.,
r5.4xlargeβr5.2xlarge) - Sometimes a switch to a different generation is appropriate (e.g., older
m4to newerm6for same capacity at lower cost)
Step 5: Stage migration and validate
- Deploy new instance type in staging/non-production first
- Run tests to ensure adequate performance
- Monitor closely for 1β2 weeks after production migration
- Have a rollback plan
Identifying Idle and Zombie Resources
Beyond right-sizing, entire resources were found to be idle:
Zombie RDS instances: Databases provisioned for specific projects but never deleted.
- Query approach: Connect to each RDS instance and check last query timestamp from performance insights
- Alternatively: Check CloudWatch metricsβif read/write IOPS have been zero for 30+ days, likely zombie
# AWS CLI command to find RDS instances with zero IOPS
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReadIOPS \
--dimensions Name=DBInstanceIdentifier,Value=mydb-instance \
--statistics Sum \
--start-time 2025-10-01T00:00:00Z \
--end-time 2025-11-01T00:00:00Z \
--period 86400
bashIf the result shows all zeros for a month, that database is not being used.
Zombie EC2 instances: Instances launched for testing or troubleshooting but never terminated.
- Query approach: Check CloudWatch CPU metrics; if CPU has been <1% for 30+ days, likely not in active use
- Alternatively: Check when the instance was last contacted (via CloudTrail for API calls, or security group/system logs)
Solution: Implement automated tagging and deletion policies:
# CloudFormation to delete untagged EC2 instances after 30 days
Resources:
ZombieInstanceCleanup:
Type: AWS::Lambda::Function
Properties:
Handler: index.lambda_handler
Runtime: python3.11
Code:
ZipFile: |
import boto3
from datetime import datetime, timedelta
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
# Find EC2 instances without 'Managed' tag
response = ec2.describe_instances(
Filters=[{'Name': 'tag-key', 'Values': ['Managed']}]
)
for reservation in response['Reservations']:
for instance in reservation['Instances']:
if instance['State']['Name'] in ['running', 'stopped']:
# Check launch time
launch_time = instance['LaunchTime'].replace(tzinfo=None)
age_days = (datetime.utcnow() - launch_time).days
if age_days > 30:
ec2.terminate_instances(InstanceIds=[instance['InstanceId']])
print(f"Terminated old instance: {instance['InstanceId']}")
yamlSpot Instance vs On-Demand vs Reserved Instance Analysis
The organization ran compute on three main purchasing models:
1. On-Demand instances: Pay-as-you-go, no commitment
- Cost: Full hourly rate
- Flexibility: Can start/stop anytime
- Use case: Unpredictable workloads, production critical systems requiring immediate scaling
2. Reserved Instances (RI): 1-year or 3-year commitment
- Cost: ~30β50% discount vs. On-Demand
- Flexibility: Limited (can't easily terminate)
- Use case: Predictable baseline load that won't change
3. Spot instances: Spare AWS capacity at discounted rates, can be interrupted
- Cost: ~70β90% discount vs. On-Demand
- Flexibility: Can be reclaimed by AWS with 2-minute notice
- Use case: Batch jobs, non-critical workloads, fault-tolerant distributed systems
Analysis methodology:
For each instance type and size running in each environment:
- Calculate average utilization and committed hours
- Determine if workload is predictable (eligible for RI) or variable (eligible for Spot)
- Calculate cost-benefit of each purchasing model
Example calculation for production EKS cluster:
Baseline load (always needed): 15 nodes of m5.2xlarge
Peak load: 25 nodes
Cost comparison:
Option A: All On-Demand
- 25 nodes Γ $0.384/hour Γ 24 hours Γ 365 days = $84,441/year
- Average utilization: 60%
Option B: 15 Reserved + 10 Spot
- RI: 15 nodes Γ $0.192/hour (50% discount) Γ 24 Γ 365 = $25,325/year
- Spot: 10 nodes Γ $0.077/hour (80% discount) Γ 24 Γ 365 = $6,749/year
- Total: $32,074/year
- Savings: $52,367/year (62% reduction)
Option C: All Reserved (3-year commitment)
- 25 nodes Γ $0.165/hour Γ 24 Γ 365 = $36,036/year
- Savings: $48,405/year (57% reduction)
- Risk: Locked in if load drops below 15 nodes
Decision made: Use Reserved Instances for the predictable baseline (15 nodes) and Spot for variable load (10 nodes). If Spot instances are interrupted, Kubernetes cluster autoscaler will provision On-Demand replacements temporarily, maintaining availability.
2.2: Storage Cost Archaeology β S3, EBS, RDS, and Databases
Storage represented the second-largest cost category (~$890K/year). Unlike compute, which is immediate and visible, storage costs accumulate silently. Storage archaeology is the process of understanding what's stored, why, and whether it's actually needed.
S3 Storage Class Analysis
S3 offers multiple storage classes with different costs and characteristics:
| Storage Class | Use Case | Monthly Cost per GB | Minimum Duration | Retrieval Latency |
|---|---|---|---|---|
| Standard | Frequently accessed data | $0.023 | None | Immediate |
| Standard-IA | Infrequent access | $0.0125 | 30 days | Immediate |
| Glacier Instant | Occasional access | $0.004 | 90 days | Minutes |
| Glacier Flexible | Archival | $0.0036 | 90 days | Hours/Days |
| Deep Archive | Long-term archival | $0.00099 | 180 days | Hours |
Current state analysis:
The organization had approximately 45 TB of S3 data spread across 87 buckets. Breakdown by storage class:
- Standard tier: 35 TB (77%)
- Infrequent Access (IA): 7 TB (16%)
- Glacier: 3 TB (7%)
Cost impact of current distribution:
- Standard: 35 TB Γ 1,024 GB Γ $0.023 = $825,920/month
- IA: 7 TB Γ 1,024 GB Γ $0.0125 = $89,600/month
- Glacier: 3 TB Γ 1,024 GB Γ $0.004 = $12,288/month
- Total: ~$927,808/month (but system was showing $1.2M+ on the bill)
Discrepancy analysis revealed additional costs:
- Data transfer out (not included above): $280K/month
- Requests (GET, PUT): $45K/month
- Other (versioning, replication, multipart uploads): $125K/month
Optimization opportunity: Lifecycle policies
Many buckets were not using S3 lifecycle policies to automatically transition data to cheaper tiers. A lifecycle policy might look like:
{
"Rules": [
{
"Id": "TransitionOldData",
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 180,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
}
]
}
jsonWith proper lifecycle policies applied to the 35 TB in Standard:
- First 30 days: Standard ($0.023/GB)
- Days 30β90: Standard-IA ($0.0125/GB)
- Days 90β180: Glacier ($0.004/GB)
- After 180 days: Deep Archive ($0.00099/GB)
Average cost per GB per month: $0.011 (vs. $0.023 for all Standard)
Savings from lifecycle policies: ~$420K/year
EBS Volume Underutilization
EBS volumes are persistent block storage attached to EC2 instances or snapshots for backup. The audit found:
- 1,240 unattached EBS volumes (zombies)
- Total size: 124 TB (1,240 volumes Γ ~100 GB average)
- Cost: $1,240/month unattached storage
- Additional snapshots of deleted volumes: ~300 GB at $0.05/GB/month = $15K/month
Root causes:
- EC2 instances terminated but volumes left behind (not set to "delete on termination")
- Old snapshots not cleaned up
- Volumes created for temporary purposes and forgotten
Solution:
- Identify and delete unattached volumes older than 30 days:
# List unattached EBS volumes older than 30 days
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--region us-east-1 \
--query 'Volumes[?CreateTime<=`2025-10-15`].{VolumeId:VolumeId,Size:Size,CreateTime:CreateTime}'
bash- Delete old snapshots:
# Delete snapshots older than 180 days
aws ec2 describe-snapshots \
--owner-ids self \
--filters Name=start-time,Values="2025-05-15" \
--query 'Snapshots[].SnapshotId' \
--output text | xargs -I {} aws ec2 delete-snapshot --snapshot-id {}
bash- Implement automation to delete volumes unattached for >30 days:
import boto3
from datetime import datetime, timedelta
ec2 = boto3.client('ec2')
def cleanup_unattached_volumes():
"""Delete unattached EBS volumes older than 30 days."""
cutoff_date = datetime.utcnow() - timedelta(days=30)
response = ec2.describe_volumes(Filters=[{'Name': 'status', 'Values': ['available']}])
for volume in response['Volumes']:
create_time = volume['CreateTime'].replace(tzinfo=None)
if create_time < cutoff_date:
# Add safeguard: check if volume has important tags
tags = {t['Key']: t['Value'] for t in volume.get('Tags', [])}
if tags.get('Protection') != 'true':
print(f"Deleting volume {volume['VolumeId']} (created {create_time})")
ec2.delete_volume(VolumeId=volume['VolumeId'])
cleanup_unattached_volumes()
pythonSavings from EBS cleanup: ~$180K/year ($15K/month Γ 12)
RDS and Database Storage Optimization
RDS is AWS's managed relational database service. The organization ran:
- 12 production RDS instances
- 8 staging/development RDS instances
- 6 read replicas
- Total allocated storage: ~8.5 TB
- Actual used storage: ~3.2 TB (38% utilization)
Key problems:
- Over-provisioned storage: Allocated 8.5 TB but used only 3.2 TB
- Over-provisioned compute: Most instances running with <30% CPU/memory utilization
- Unnecessary Multi-AZ: Development and staging databases had Multi-AZ enabled (doubles cost)
- Excessive backups: 30-day retention with automatic daily backups β 30 backup copies always stored
Optimization approach:
Step 1: Rightsize compute
def find_rds_rightsizing_opportunities():
"""Identify RDS instances that can be downsized."""
rds = boto3.client('rds')
cloudwatch = boto3.client('cloudwatch')
response = rds.describe_db_instances()
for db in response['DBInstances']:
db_id = db['DBInstanceIdentifier']
instance_class = db['DBInstanceClass']
# Get CPU utilization
cpu_response = cloudwatch.get_metric_statistics(
Namespace='AWS/RDS',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': db_id}],
StartTime=datetime.utcnow() - timedelta(days=30),
EndTime=datetime.utcnow(),
Period=3600,
Statistics=['Average', 'Maximum']
)
datapoints = cpu_response['Datapoints']
if datapoints:
avg_cpu = sum(p['Average'] for p in datapoints) / len(datapoints)
max_cpu = max(p['Maximum'] for p in datapoints)
if avg_cpu < 20 and max_cpu < 60:
print(f"{db_id} ({instance_class}): {avg_cpu:.1f}% avg, {max_cpu:.1f}% max β DOWNSIZE")
find_rds_rightsizing_opportunities()
pythonStep 2: Disable Multi-AZ for non-critical databases
def disable_multiaz_noncritical():
"""Disable Multi-AZ for non-critical RDS instances."""
rds = boto3.client('rds')
response = rds.describe_db_instances()
for db in response['DBInstances']:
db_id = db['DBInstanceIdentifier']
is_multiaz = db['MultiAZ']
tags = db.get('TagList', [])
# Check if instance is non-production
tag_dict = {t['Key']: t['Value'] for t in tags}
environment = tag_dict.get('Environment', '')
if is_multiaz and environment in ['dev', 'staging']:
print(f"Disabling Multi-AZ for {db_id} (will save ~50%)")
rds.modify_db_instance(
DBInstanceIdentifier=db_id,
MultiAZ=False,
ApplyImmediately=False # Apply during maintenance window
)
disable_multiaz_noncritical()
pythonStep 3: Optimize backup retention
def optimize_rds_backups():
"""Reduce backup retention to necessary minimum."""
rds = boto3.client('rds')
response = rds.describe_db_instances()
for db in response['DBInstances']:
db_id = db['DBInstanceIdentifier']
tags = db.get('TagList', [])
tag_dict = {t['Key']: t['Value'] for t in tags}
environment = tag_dict.get('Environment', '')
# Set different retention based on environment
if environment == 'dev':
retention_days = 7
elif environment == 'staging':
retention_days = 14
else: # production
retention_days = 30
current_retention = db['BackupRetentionPeriod']
if current_retention != retention_days:
print(f"Setting {db_id} backup retention to {retention_days} days")
rds.modify_db_instance(
DBInstanceIdentifier=db_id,
BackupRetentionPeriod=retention_days,
ApplyImmediately=False
)
optimize_rds_backups()
pythonSavings from RDS optimization:
- Disabling Multi-AZ on staging/dev (8 instances): ~$156K/year
- Reducing backup retention and cleanup old snapshots: ~$48K/year
- Rightsizing compute (2β3 instance generations): ~$84K/year
- Total RDS savings: ~$288K/year
2.3: Network & Data Transfer Costs
Data transfer costs are often the most overlooked category, yet they can represent 15β25% of total cloud spend. The analysis identified multiple optimization opportunities.
Inter-Region Data Transfer Analysis
Data transfer between AWS regions costs $0.02/GB (same price regardless of direction). The organization had:
- Production infrastructure in
us-east-1 - Disaster recovery replica in
us-west-2 - Continuous replication of data: ~2 TB/day cross-region
- Cost: 2 TB Γ 1,024 GB Γ $0.02 Γ 30 days = $1,228.80/day (~$36K/month)
Analysis:
- Replication was for disaster recovery purposes (RPO: 1 day, RTO: 4 hours)
- Replication also supported occasional read-replica queries from west coast users
Optimization options:
-
Stop continuous replication and implement on-demand backup transfer (save $36K/month but accept higher RTO)
- Not acceptable: Violates business requirements for disaster recovery
-
Implement VPC endpoints for private connectivity (no cost reduction, just security)
-
Compress and deduplicate data before transfer
- Potential savings: 30β40% of transfer volume
-
Use AWS DataSync with compression and scheduling during off-peak hours
-
Consolidate to single region with cross-AZ redundancy (save $36K but increase regional risk)
Decision: Compress data before transfer (40% reduction) and implement intelligent scheduling to transfer during off-peak hours (off-peak transfer is same price but reduces concurrent transfer impact).
- Savings from inter-region optimization: ~$14K/year
VPC Endpoint Opportunities
NAT Gateways are used to allow instances in private subnets to reach the Internet. Cost: $0.045/hour per NAT Gateway + $0.045 per GB of data processed.
The organization had:
- 2 NAT Gateways (high availability across 2 AZs)
- ~50 GB/day of outbound traffic
- Cost: (2 Γ $0.045 Γ 24 Γ 365) + (50 Γ 30 Γ 365 Γ $0.045) = $788 + $247,500 = ~$248K/year
However, some of this traffic was going to AWS services (S3, DynamoDB, SQS, etc.). For these, VPC endpoints can be used instead of NAT, eliminating the cost.
VPC Endpoints analysis:
- Traffic to S3 (via NAT): ~30 GB/day
- Traffic to DynamoDB (via NAT): ~5 GB/day
- Traffic to other AWS services: ~10 GB/day
- Traffic to public Internet: ~5 GB/day
Optimization: Implement gateway endpoints for S3 and DynamoDB, and interface endpoints for other AWS services.
{
"VPCEndpoint": {
"VpcId": "vpc-12345678",
"ServiceName": "com.amazonaws.us-east-1.s3",
"RouteTableIds": ["rtb-12345678"],
"PolicyDocument": {
"Statement": [{
"Effect": "Allow",
"Principal": "*",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::*/*"
}]
}
}
}
jsonSavings from VPC endpoints: ~$84K/year (eliminating 45 GB/day of NAT gateway charges)
CloudFront vs Direct S3 Access Cost Comparison
The organization served static content (images, CSS, JavaScript) directly from S3 in some applications, and through CloudFront in others.
Cost comparison for 100 GB/month of content served:
Option A: Direct S3 access
- S3 data transfer out: 100 GB Γ $0.09 = $9
- S3 requests (10K GET requests): 10K Γ $0.0004 = $4
- Total: ~$13/month per 100 GB
Option B: CloudFront with S3 origin
- CloudFront data transfer out: 100 GB Γ $0.085 (first 10 TB tier) = $8.50
- CloudFront requests (10K): 10K Γ $0.0075 = $75
- Origin shield (optional): $0.005 per request = $50
- S3 origin requests (with cache): 500 requests Γ $0.0004 = $0.20
- S3 data transfer to CloudFront (cache misses): ~10 GB Γ $0.02 (internal AWS) = $0.20
- Total: ~$133.90/month
Analysis: Direct S3 access is 10x cheaper if:
- Content is accessed from only a few regions
- Cache hit rates are not critical
- Origin latency is acceptable
But CloudFront provides:
- Global CDN with cache nodes in 200+ locations (lower latency)
- DDoS protection
- SSL/TLS offloading
- Request batching (many small requests combined)
Decision: Keep CloudFront for public-facing static content (DDoS protection + performance), but investigate if some internal APIs using S3 could switch to direct access.
- Savings from API gateway optimization: ~$24K/year
Load Balancer Optimization
AWS Network Load Balancers (NLB) and Application Load Balancers (ALB) both have costs:
- LCU (Load Balancer Capacity Unit): Metered based on new connections, active connections, processed bytes, and rule evaluations
- Typical cost: $0.006 per LCU-hour for ALB, $0.006 per LCU-hour for NLB
The organization ran:
- 3 Application Load Balancers (production, staging, dev)
- 1 Network Load Balancer (payment processing, high performance)
- Average LCU consumption: ~80 LCU combined
- Cost: 80 LCU Γ $0.006 Γ 24 Γ 365 = ~$42K/year
Optimization: Consolidate load balancers where possible.
-
Production and staging could share infrastructure with different target groups
-
Dev environment could use cheaper Application Load Balancer or simple routing
-
Savings from load balancer consolidation: ~$12K/year
2.4: Kubernetes Cost Attribution and Namespace-Level Tracking
For organizations running Kubernetes on AWS (via EKS), understanding cost per namespace, pod, or service is critical for driving cost awareness and accountability.
Kubernetes Cluster Cost Tracking
Kubernetes clusters consist of:
- Master/Control plane: Managed by AWS EKS ($0.10/hour or ~$73/month per cluster)
- Worker nodes: EC2 instances you provision and pay for
- Add-ons: Networking (CNI), monitoring (CloudWatch agent), logging, etc.
For the engagement:
- Production cluster: 25 nodes (m5.2xlarge) = ~$18,240/month in compute
- Staging cluster: 12 nodes (t3.xlarge) = ~$2,880/month in compute
- Dev cluster: 10 nodes (t3.large) = ~$1,200/month in compute
- EKS control plane (3 clusters): 3 Γ $73 = $219/month
- Total Kubernetes infrastructure: ~$22.5K/month ($270K/year)
Namespace-Level Cost Tracking
To attribute costs to teams/applications, Kubernetes namespaces can be tagged and linked to pod resource requests:
import boto3
def get_kubernetes_cost_per_namespace(cluster_name, start_date, end_date):
"""
Get cost per Kubernetes namespace by looking at:
1. Pod resource requests (from Kubernetes API)
2. Node allocation (from AWS)
3. Namespace tags (custom tagging)
"""
# This is pseudo-code that would integrate with Kubernetes API
# In practice, you'd use a Kubernetes cost allocation tool like Kubecost
namespaces = {
'production-platform': {
'pod_count': 150,
'avg_cpu_request': 0.5,
'avg_memory_request': 512, # MB
'storage_gb': 50
},
'production-api': {
'pod_count': 80,
'avg_cpu_request': 1.0,
'avg_memory_request': 1024,
'storage_gb': 100
},
'staging': {
'pod_count': 40,
'avg_cpu_request': 0.25,
'avg_memory_request': 256,
'storage_gb': 20
},
'development': {
'pod_count': 60,
'avg_cpu_request': 0.1,
'avg_memory_request': 128,
'storage_gb': 15
},
}
# Simplified cost calculation
hourly_node_cost = 25 * 0.384 # 25 nodes Γ m5.2xlarge hourly rate
total_requested_cpu = sum(ns['pod_count'] * ns['avg_cpu_request']
for ns in namespaces.values())
for namespace, metrics in namespaces.items():
cpu_fraction = (metrics['pod_count'] * metrics['avg_cpu_request']) / total_requested_cpu
monthly_cost = hourly_node_cost * 730 * cpu_fraction + metrics['storage_gb'] * 0.10
print(f"{namespace}: ${monthly_cost:,.0f}/month")
print(f" CPU fraction: {cpu_fraction:.1%}")
print(f" Pod count: {metrics['pod_count']}")
print()
pythonPod Resource Request vs Actual Usage Analysis
A common pattern in Kubernetes: pods request more resources than they actually use.
Example:
apiVersion: v1
kind: Pod
metadata:
name: web-app
namespace: production-api
spec:
containers:
- name: web-app
image: web-app:latest
resources:
requests:
memory: "1Gi"
cpu: "1" # Requests 1 CPU core
limits:
memory: "2Gi"
cpu: "2"
# Actual usage: 200m CPU, 256Mi memory (20% and 25% of request)
yamlThis pod "reserves" 1 CPU and 1 GB memory, but uses only 200m CPU and 256 MB memory. Over a cluster of 150 pods, this inefficiency adds up.
Optimization process:
- Use Prometheus to collect actual CPU/memory usage over 30 days
- Calculate 95th percentile usage (to account for spikes)
- Recommend new resource requests based on actual usage + headroom
# Prometheus query to get actual CPU usage per pod
query = '''
avg(rate(container_cpu_usage_seconds_total[5m])) by (pod_name, namespace)
'''
# If actual usage is 200m and we want 20% headroom:
# new_request = 200m Γ 1.2 = 240m (vs. old request of 1000m)
pythonBy right-sizing pod resource requests across 150 production pods:
- Potential cluster size reduction: 30β40%
- Potential cost savings: ~$60Kβ80K/year on compute
Horizontal Pod Autoscaler (HPA) Optimization
Kubernetes HPA automatically scales the number of pods based on metrics. The configuration specifies:
- Target metric (e.g., average CPU utilization = 70%)
- Min and max replicas
Current configuration example:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
yamlOptimization considerations:
- Min replicas: Set too high for peak handling in non-peak times β unnecessary cost
- Max replicas: Set too high β allows runaway costs if misconfigured app scales indefinitely
- Target utilization: If too low (50%), over-provisioning; if too high (90%), risk of response time degradation
Example optimization:
For the web-app deployment with highly variable traffic (3 replicas minimum, 20 maximum):
-
Current: min=5, max=50 (to handle unlikely traffic peaks)
-
Optimized: min=2, max=25 (more realistic peaks)
-
Additional: Add scheduled scaling (known peak hours β pre-scale to 10 min replicas; off-peak β scale to 1)
-
Savings from HPA optimization: ~$20Kβ30K/year
Cluster Autoscaling Tuning
Cluster autoscaling adds/removes worker nodes based on pending pod resource requests that cannot be scheduled.
Key parameters:
- scale-up interval: How often to check for pending pods (default: 10 seconds)
- scale-down delay: Wait before scaling down underutilized nodes (default: 10 minutes)
- scale-down utilization threshold: If node <50% utilized, eligible for scale-down (default)
Current configuration:
# Kubeadm cluster autoscaler config
--scale-down-enabled=true
--scale-down-delay-after-add=10m
--scale-down-delay-after-failure=3m
--scale-down-delay-after-delete=0s
--scale-down-delay-after-failure=3m
--scale-down-unneeded-time=10m
--scale-down-unready-time=20m
yamlOptimization:
-
Aggressive scale-down: Wait only 5 minutes instead of 10 before scaling down underutilized nodes
-
Stricter utilization threshold: Scale down nodes <30% utilized (more aggressive)
-
Conservative scale-up: Batch pod provisioning requests and scale up in larger increments (reduces thrashing)
-
Savings from cluster autoscaler tuning: ~$15Kβ25K/year
3: Implementation Strategy β From Analysis to Savings
Armed with detailed analysis, the engagement shifted to implementationβactually making the changes and realizing the savings.
The implementation was structured in three phases based on complexity, risk, and time to implement:
- Quick Wins (0β30 days): Low-risk, high-impact changes with minimal engineering effort
- Architectural Changes (30β90 days): Medium-risk, high-impact changes requiring more planning and testing
- Long-term Optimization (90+ days): Complex, architecturally significant changes providing sustained benefits
3.1: Quick Wins (0β30 Days) β Immediate Impact, Minimal Risk
These are changes that:
- Reduce cost without touching application logic or architecture
- Can be rolled out quickly with minimal testing
- Provide immediate, measurable savings
Quick Win 1: Delete Unattached EBS Volumes ($84K/Year Savings)
Scope: 1,240 unattached EBS volumes totaling 124 TB
Implementation:
#!/bin/bash
# cleanup_ebs_volumes.sh
REGIONS=("us-east-1" "us-west-2" "eu-west-1" "ap-southeast-1")
TODAY=$(date +%s)
THIRTY_DAYS_AGO=$((TODAY - 30 * 24 * 3600))
for region in "${REGIONS[@]}"; do
echo "Checking region: $region"
aws ec2 describe-volumes \
--region "$region" \
--filters Name=status,Values=available \
--query 'Volumes[].{VolumeId:VolumeId,Size:Size,CreateTime:CreateTime,Tags:Tags}' \
--output json | jq -r '.[] | select(.CreateTime | fromdateiso8601 < '$THIRTY_DAYS_AGO') | .VolumeId' | while read volume_id; do
# Get tags to check for protection
tags=$(aws ec2 describe-volumes --region "$region" --volume-ids "$volume_id" --query 'Volumes[0].Tags[?Key==`Protection`].Value' --output text)
if [ -z "$tags" ] || [ "$tags" != "true" ]; then
echo "Deleting volume: $volume_id"
aws ec2 delete-volume --region "$region" --volume-id "$volume_id" 2>/dev/null || echo "Failed to delete $volume_id"
fi
done
done
bashProcess:
- Run script in dry-run mode first to identify volumes
- Manually verify that volumes are indeed unused
- Implement tagging policy to mark volumes that should be kept
- Run deletion script
Results:
- Deleted 1,240 unattached volumes
- Freed up 124 TB of storage
- Monthly savings: $7,000 β Annual: $84,000
Quick Win 2: Right-Size Over-Provisioned RDS Instances ($156K/Year Savings)
Scope: 6 production RDS instances running at 20β30% utilization; 8 staging/dev instances at 10β20%
Implementation for production RDS:
- Create a read replica of the current instance with a smaller instance type
- Test application performance on the read replica
- Promote read replica to primary (cut-over traffic)
- Delete old instance
Example process for one production database:
#!/bin/bash
# Downsize RDS from db.r5.4xlarge to db.r5.2xlarge
SOURCE_DB="mydb-prod"
REPLICA_DB="mydb-prod-downsize-replica"
TARGET_INSTANCE_CLASS="db.r5.2xlarge"
# Step 1: Create read replica with new instance type
aws rds create-db-instance-read-replica \
--db-instance-identifier "$REPLICA_DB" \
--source-db-instance-identifier "$SOURCE_DB" \
--db-instance-class "$TARGET_INSTANCE_CLASS" \
--region us-east-1
# Wait for replica to be available
aws rds wait db-instance-available --db-instance-identifier "$REPLICA_DB"
# Step 2: Run performance tests on replica (point test traffic to replica)
echo "Performance tests on $REPLICA_DB (run tests here)"
# Step 3: Promote replica to standalone (this breaks replication)
aws rds promote-read-replica \
--db-instance-identifier "$REPLICA_DB" \
--region us-east-1
# Step 4: After successful promotion and monitoring, delete old instance
aws rds delete-db-instance \
--db-instance-identifier "$SOURCE_DB" \
--skip-final-snapshot \
--region us-east-1
bashKey considerations:
- Read replica creation takes 30β60 minutes (downtime impact: minimal)
- Promotion of replica involves 1β2 minutes of replication lag before DNS cutover completes
- Test thoroughly on replica before promotion
Results for 6 production instances:
| Instance | Old Type | New Type | Old Cost/mo | New Cost/mo | Savings/mo |
|---|---|---|---|---|---|
| mydb-prod | db.r5.4xlarge | db.r5.2xlarge | $1,755 | $876 | $879 |
| analytics-db | db.r5.4xlarge | db.r5.2xlarge | $1,755 | $876 | $879 |
| reporting-db | db.r5.2xlarge | db.r5.xlarge | $876 | $438 | $438 |
| events-db | db.m5.4xlarge | db.m5.2xlarge | $1,464 | $732 | $732 |
| cache-db | db.r5.2xlarge | db.r5.large | $876 | $438 | $438 |
| logs-db | db.m5.4xlarge | db.m5.2xlarge | $1,464 | $732 | $732 |
| Total (production) | $8,190 | $4,092 | $4,098 |
For staging/development (8 instances), similar savings of ~$6,500/month by downsizing and disabling Multi-AZ:
- Total monthly savings: ~$10,600 β Annual: $156,000
Quick Win 3: Implement S3 Lifecycle Policies ($92K/Year Savings)
Scope: 45 TB of S3 data currently all in Standard tier
Implementation:
Create and apply lifecycle policies to automatically transition data to cheaper tiers:
import boto3
import json
s3 = boto3.client('s3')
def apply_lifecycle_policies():
"""Apply lifecycle policies to all S3 buckets."""
# List all buckets
buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
bucket_name = bucket['Name']
# Check if bucket contains time-series data (logs, backups, etc.)
# Only apply lifecycle to appropriate buckets (exclude config, active data)
if 'logs' in bucket_name or 'backups' in bucket_name or 'archive' in bucket_name:
lifecycle_policy = {
"Rules": [
{
"Id": "TransitionOldData",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555 # 7 years
}
}
]
}
try:
s3.put_bucket_lifecycle_configuration(
Bucket=bucket_name,
LifecycleConfiguration=lifecycle_policy
)
print(f"Applied lifecycle policy to: {bucket_name}")
except Exception as e:
print(f"Failed to apply lifecycle to {bucket_name}: {e}")
apply_lifecycle_policies()
pythonTerraform configuration for infrastructure-as-code:
resource "aws_s3_bucket" "application_logs" {
bucket = "my-app-logs"
}
resource "aws_s3_bucket_lifecycle_configuration" "application_logs" {
bucket = aws_s3_bucket.application_logs.id
rule {
id = "transition-old-logs"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
expiration {
days = 2555
}
}
}
hclCost impact:
- Before: 45 TB @ $0.023/GB/month = $1,065/month
- After: Average ~$0.011/GB/month (blended due to lifecycle) = ~$514/month
- Monthly savings: ~$550 β Annual: $6,600 from storage alone
But the bigger impact is on request costs and data transfer:
-
Old policy: Accessing old data in Standard (requires egress charge)
-
New policy: Old data in Glacier (cheaper retrieval, most is not accessed)
-
Additional savings from reduced requests: ~$20K/year
-
Total annual savings: ~$92,000
Quick Win 4: Purchase Reserved Instances ($480K/Year Savings)
Scope: Baseline compute load that is predictable and won't decrease
Analysis:
- Production EKS cluster: 15 nodes of m5.2xlarge (predictable baseline, won't shrink)
- Always-on instances for specific services: 8 more m5.2xlarge
- Other services: 6 m5.xlarge, 4 c5.2xlarge
Implementation:
import boto3
ec2 = boto3.client('ec2')
# Purchase Reserved Instances (1-year commitment, 40% discount typical)
reservations = [
{'InstanceType': 'm5.2xlarge', 'Count': 23, 'Term': '1 year'},
{'InstanceType': 'm5.xlarge', 'Count': 6, 'Term': '1 year'},
{'InstanceType': 'c5.2xlarge', 'Count': 4, 'Term': '1 year'},
]
for res in reservations:
response = ec2.purchase_reserved_instances(
ReservedInstancesOfferingId='<offering_id>', # From DescribeReservedInstancesOfferings
InstanceCount=res['Count'],
)
print(f"Purchased RI for {res['Count']}x {res['InstanceType']}")
pythonCost comparison:
| Instance Type | Count | On-Demand/month | RI (1-year)/month | Savings/month |
|---|---|---|---|---|
| m5.2xlarge | 23 | $8,832 | $5,299 | $3,533 |
| m5.xlarge | 6 | $1,152 | $691 | $461 |
| c5.2xlarge | 4 | $1,152 | $691 | $461 |
| Total | $11,136 | $6,681 | $4,455 |
- Monthly savings: $4,455 β Annual: $53,460 on reserved instances alone
But variable workloads still use Spot instances at 70β80% discounts:
-
Additional Spot capacity (10β15 nodes during peak): ~$40K/year savings via Spot
-
Total RI + Spot optimization: ~$480,000/year
3.2: Architectural Changes (30β90 Days) β Structural Efficiency
These changes require more planning and engineering but provide larger, sustained savings:
Change 1: Migrate to Spot Instances with Fallback ($200K+/Year)
Spot instances are spare AWS capacity offered at 70β90% discount but can be interrupted by AWS with 2-minute notice.
Architecture change: Run batch jobs and fault-tolerant services on Spot, with automatic fallback to On-Demand if Spot capacity is unavailable.
Implementation in Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: batch-job
namespace: data-processing
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- weight: 50
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
terminationGracePeriodSeconds: 120 # Allow graceful shutdown before termination
containers:
- name: batch-processor
image: batch-processor:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Grace period for connections to drain
yamlKarpenter provisioner (alternative to Cluster Autoscaler with better Spot support):
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
workload-type: general
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Prefer Spot, fallback to On-Demand
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m5.2xlarge"] # Flexible instance types
nodeClassRef:
name: default
limits:
resources:
cpu: 1000
memory: 1000Gi
disruption:
consolidateAfter: 30s
consolidationPolicy: cost # Consolidate to minimize cost
yamlCost savings calculation:
-
10β15 nodes running batch jobs on Spot vs. On-Demand
-
Spot cost: $0.077β0.115/hour per m5.2xlarge (80% discount)
-
On-Demand cost: $0.384/hour per m5.2xlarge
-
Savings: ~$2,200β2,800/month from Spot usage
-
Annual savings: ~$200K+
Change 2: Implement Cluster Autoscaling with Aggressive Scale-Down ($100K+/Year)
Current state: Cluster has minimum 25 nodes, maximum 50 nodes, but often runs 35β40 nodes even during off-peak.
Optimization: More aggressive scale-down policies to ensure nodes are deallocated when no longer needed.
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config: |
scale-down-enabled: "true"
scale-down-delay-after-add: "5m"
scale-down-unneeded-time: "5m"
scale-down-utilization-threshold: "0.5" # Scale down if <50% utilized
max-scale-down-parallelism: "10"
scale-down-delay-after-failure: "3m"
scale-down-delay-after-delete: "0s"
skip-nodes-with-system-pods: "false"
skip-nodes-with-local-storage: "false"
yamlFurther optimization: Use scheduled autoscaling to proactively adjust cluster size based on known traffic patterns.
import boto3
from datetime import datetime, timedelta
import schedule
import time
autoscaling = boto3.client('autoscaling')
eks = boto3.client('eks')
def scale_for_peak_hours():
"""Scale up cluster before peak hours (8 AM)."""
autoscaling.set_desired_capacity(
AutoScalingGroupName='eks-prod-nodes',
DesiredCapacity=40,
HonorCooldown=False
)
print("Scaled up for peak hours")
def scale_for_off_hours():
"""Scale down cluster during off-peak (6 PM)."""
autoscaling.set_desired_capacity(
AutoScalingGroupName='eks-prod-nodes',
DesiredCapacity=15,
HonorCooldown=False
)
print("Scaled down for off-peak")
# Schedule scaling events
schedule.every().monday.at("08:00").do(scale_for_peak_hours)
schedule.every().friday.at("18:00").do(scale_for_off_hours)
while True:
schedule.run_pending()
time.sleep(60)
pythonSavings: Aggressive scale-down + scheduled scaling reduces average cluster size from 37 nodes to 20 nodes.
- Annual savings: ~$100K+
Change 3: Database Read Replica Optimization ($85K/Year)
Current state: 6 read replicas for reporting and analytics, but they're expensive and not always necessary.
Optimization:
- Move infrequent analytical queries to Redshift (designed for OLAP)
- Use database query caching (Redis) for common queries
- Downsize read replicas (they don't need the same resources as primary)
- Schedule downtime for read replicas outside business hours
Implementation:
import boto3
rds = boto3.client('rds')
# Downsize read replicas
read_replicas = [
{'id': 'mydb-replica-1', 'old_type': 'db.r5.2xlarge', 'new_type': 'db.r5.large'},
{'id': 'mydb-replica-2', 'old_type': 'db.r5.2xlarge', 'new_type': 'db.r5.large'},
]
for replica in read_replicas:
rds.modify_db_instance(
DBInstanceIdentifier=replica['id'],
DBInstanceClass=replica['new_type'],
ApplyImmediately=False # Apply during maintenance window
)
print(f"Downsizing {replica['id']} to {replica['new_type']}")
# Implement caching for read replicas
def implement_query_cache():
"""Cache expensive queries in Redis to reduce DB load."""
redis_client = redis.Redis(host='redis-cluster.example.com', port=6379)
def get_expensive_report(user_id):
cache_key = f"report:{user_id}:monthly"
# Try cache first
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Query database if not in cache
query = f"SELECT * FROM reports WHERE user_id = {user_id}"
result = db.execute(query)
# Cache for 1 hour
redis_client.setex(cache_key, 3600, json.dumps(result))
return result
return get_expensive_report
pythonSavings: Downsizing replicas + caching + moving analytics to Redshift reduces read replica costs significantly.
- Annual savings: ~$85K
Change 4: Cache Layer Implementation ($120K+/Year)
Current state: Database receives requests for frequently accessed data (customer profiles, feature flags, pricing tiers) repeatedly.
Optimization: Implement Redis cluster to cache hot data, reducing database load by 40β50%.
import redis
import json
from functools import wraps
redis_client = redis.Redis(
host='elasticache-redis.example.com',
port=6379,
decode_responses=True
)
def cache_result(ttl=3600):
"""Decorator to cache function results in Redis."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Generate cache key from function name and arguments
cache_key = f"{func.__name__}:{str(args)}:{str(kwargs)}"
# Try cache first
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Compute result and cache
result = func(*args, **kwargs)
redis_client.setex(cache_key, ttl, json.dumps(result))
return result
return wrapper
return decorator
@cache_result(ttl=86400) # Cache for 24 hours
def get_customer_profile(customer_id):
"""Get customer data from database (cached)."""
query = f"SELECT * FROM customers WHERE id = {customer_id}"
return db.execute(query)
@cache_result(ttl=3600)
def get_feature_flags():
"""Get feature flags (cached for 1 hour)."""
return get_flags_from_db()
pythonImpact:
-
Database read load reduced by 40β50% (fewer queries needed)
-
Can downsize read replicas further
-
Improved application latency (Redis is faster than database)
-
Annual savings: ~$120K (from reduced database resources)
3.3: Long-Term Optimization (90+ Days) β Architectural Redesign
These are larger, more strategic changes with extended timelines but provide the most substantial, ongoing benefits:
Long-Term Change 1: Multi-Region Strategy Consolidation ($180K+/Year)
Current state: Infrastructure spread across 3 regions (us-east-1, us-west-2, eu-west-1) with full redundancy in each.
Optimization: Consolidate to 2 primary regions with lightweight read-only replicas or scheduled backups in tertiary region.
Implementation:
# Before: Full production in 3 regions
# us-east-1: 25 nodes EKS, RDS primary, Redis cluster = $60K/month
# us-west-2: 20 nodes EKS, RDS replica, Redis replica = $48K/month
# eu-west-1: 15 nodes EKS, RDS replica, Redis replica = $36K/month
# Total: $144K/month
# After: Primary + standby + minimal tertiary
# us-east-1: 25 nodes EKS, RDS primary, Redis cluster = $60K/month
# us-west-2: 10 nodes EKS, RDS read-only, Redis read-only = $24K/month
# eu-west-1: 2 nodes EKS, Backup only (no live traffic) = $5K/month
# Total: $89K/month
yaml- Monthly savings: $55K β Annual: $660K
But this requires architectural changes:
-
Routing logic to handle failure scenarios
-
Replication strategy from primary to standbys
-
Failover procedures
-
Realistic annual savings (phased): ~$180K (after accounting for increased operational complexity)
Long-Term Change 2: Serverless Migration for Appropriate Workloads ($95K/Year)
Current state: Many non-critical services running on EKS 24/7, consuming baseline resources even during idle periods.
Opportunity: Migrate appropriate workloads to AWS Lambda (serverless), paying only for execution time.
Example workload: Scheduled data processing, webhook handlers, periodic reporting
# Before: ECS service running 24/7
# 4 tasks Γ 1 vCPU Γ $0.042/hour Γ 24 Γ 365 = $1,469/month
# After: Lambda functions
# Executions: 10,000/month
# Duration: 5 seconds average
# Memory: 512 MB
# Cost: 10,000 Γ (5 / 1,000) Γ (512 / 128) Γ $0.00001667 = $4/month
# Savings: $1,469 - $4 = $1,465/month per workload
pythonFor the organization, 8β10 services were identified as candidates for Lambda migration:
- Annual savings: ~$95K
Long-Term Change 3: Database Sharding for Cost Efficiency ($110K+/Year)
Current state: Single large RDS instance handling all customer data.
Optimization: Shard database by customer or region, distributing load across smaller, cheaper instances.
# Before: 1 Γ db.r5.4xlarge = $1,755/month
# After: Shard across 4 Γ db.r5.xlarge = 4 Γ $438 = $1,752/month
# But more efficient: Can scale individual shards independently
# Production can run on r5.xlarge, staging/analytics on smaller instances
# Additional benefit: Can use cheaper storage tiers for specific shards
# Hot data (active customers): Standard or SSD
# Warm data (inactive customers): Standard-IA
# Cold data (archived): Glacier
python- Annual savings: ~$110K
Long-Term Change 4: Custom Scheduling for Non-Production Environments ($140K+/Year)
Current state: Development and staging environments run 24/7 even during evenings and weekends.
Optimization: Automatically stop/start non-production environments outside business hours.
# Lambda function to stop/start non-prod environments on schedule
import boto3
import json
from datetime import datetime
ec2 = boto3.client('ec2')
rds = boto3.client('rds')
def lambda_handler(event, context):
"""Stop/start non-prod resources based on schedule."""
hour = datetime.now().hour
day = datetime.now().weekday() # 0-6 (Mon-Sun)
# Stop resources outside business hours (6 PM - 8 AM on weekdays, all day Sunday)
should_stop = (
(hour < 8 or hour >= 18) and day < 5 # Weekday off-hours
) or (day == 6) # Sunday
if should_stop:
# Stop RDS instances tagged with Environment=staging or dev
rds_response = rds.describe_db_instances()
for db in rds_response['DBInstances']:
tags = {t['Key']: t['Value'] for t in db.get('TagList', [])}
if tags.get('Environment') in ['staging', 'dev'] and db['DBInstanceStatus'] == 'available':
print(f"Stopping RDS instance: {db['DBInstanceIdentifier']}")
rds.stop_db_instance(DBInstanceIdentifier=db['DBInstanceIdentifier'])
# Stop EC2 instances tagged for stopping
ec2_response = ec2.describe_instances(
Filters=[
{'Name': 'tag:StopOnSchedule', 'Values': ['true']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
for reservation in ec2_response['Reservations']:
for instance in reservation['Instances']:
print(f"Stopping EC2 instance: {instance['InstanceId']}")
ec2.stop_instances(InstanceIds=[instance['InstanceId']])
else:
# Start resources during business hours
rds_response = rds.describe_db_instances()
for db in rds_response['DBInstances']:
tags = {t['Key']: t['Value'] for t in db.get('TagList', [])}
if tags.get('Environment') in ['staging', 'dev'] and db['DBInstanceStatus'] == 'stopped':
print(f"Starting RDS instance: {db['DBInstanceIdentifier']}")
rds.start_db_instance(DBInstanceIdentifier=db['DBInstanceIdentifier'])
return {'statusCode': 200, 'body': json.dumps('Done')}
# Deploy this as a Lambda function triggered by EventBridge (CloudWatch Events)
# Schedule: cron(0 18 ? * MON-FRI *) to trigger at 6 PM weekdays
pythonCost impact:
-
Staging environment: ~$15K/month (stopped 14 hours Γ 5 days + all weekend)
-
Savings: ~$8K/month from scheduled stops
-
Development environment: ~$10K/month β ~$5K/month saved
-
Annual savings: ~$140K+
4: Automation & Continuous Optimization β Making Cost Optimization Sustainable
One-time optimization efforts provide initial savings, but without automation and continuous monitoring, costs creep back up over time as new resources are provisioned, sprawl accumulates, and optimization attention wanes.
This section details the automation and monitoring infrastructure that enables sustained, continuous cost optimization.
Building a FinOps Automation Platform
A FinOps automation platform integrates cost visibility, anomaly detection, recommendations, and policy enforcement.
Cost Data Pipeline
AWS Cost & Usage Reports (S3)
β
Extract, Transform, Load (ETL)
β
Data Warehouse (BigQuery/Redshift)
β
Analytics & Reporting Layer
β
Alerting & Automation
β
Team Dashboards, Slack notifications, Auto-corrections
Implementation with AWS Glue and Athena:
import boto3
import pandas as pd
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
# AWS Glue job to process Cost & Usage Reports
spark = SparkSession.builder.appName("CostAnalysis").getOrCreate()
glue_context = GlueContext(spark)
# Read CUR data from S3
cur_df = glue_context.create_dynamic_frame.from_options(
format_options={"multiline": False},
connection_type="s3",
format="csv",
connection_options={"paths": ["s3://our-cost-data/cur/"]},
transformation_ctx="cur_data"
)
# Transform to analytics format
cur_df = cur_df.toDF()
cur_df = cur_df.select(
"bill_invoice_id",
"bill_billing_period_start_date",
"product_product_family",
"line_item_product_code",
"line_item_usage_type",
"line_item_unblended_cost",
"resource_tags_user_*"
)
# Write to Redshift for querying
glue_context.write_dynamic_frame.from_options(
frame=cur_df,
connection_type="redshift",
connection_options={
"url": "jdbc:redshift://redshift-cluster.example.com:5439/analytics",
"database": "cost_analytics",
"table": "cur_daily",
"user": "admin",
"password": "secret"
}
)
pythonDaily Cost Anomaly Detection
Detect unusual spikes in spend that might indicate misconfiguration or runaway workloads:
import boto3
import numpy as np
from datetime import datetime, timedelta
import json
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
def detect_cost_anomalies():
"""Detect daily cost anomalies using statistical analysis."""
# Get cost data for the last 60 days
ce = boto3.client('ce')
end_date = datetime.utcnow().date()
start_date = end_date - timedelta(days=60)
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date.strftime('%Y-%m-%d'),
'End': end_date.strftime('%Y-%m-%d')
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'}
]
)
# Analyze each service for anomalies
for result in response['ResultsByTime']:
date = result['TimePeriod']['Start']
for group in result['Groups']:
service = group['Keys'][0]
cost = float(group['Metrics']['UnblendedCost']['Amount'])
# Get historical mean and std for this service
historical_costs = get_historical_costs(service, days=30)
mean_cost = np.mean(historical_costs)
std_cost = np.std(historical_costs)
# Flag if cost is >2 standard deviations above mean
if cost > mean_cost + (2 * std_cost):
anomaly_severity = (cost - mean_cost) / std_cost
message = f"""
COST ANOMALY DETECTED
Service: {service}
Date: {date}
Cost: ${cost:,.2f}
Expected: ${mean_cost:,.2f} Β± ${std_cost:,.2f}
Deviation: {anomaly_severity:.1f}Ο
"""
# Send alert
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789:cost-alerts',
Subject=f'Cost Anomaly: {service}',
Message=message
)
# Log for investigation
print(message)
def get_historical_costs(service, days=30):
"""Get historical costs for a service."""
# Query data warehouse
# Returns list of daily costs for last N days
pass
# Schedule this function to run daily
# Using CloudWatch Events -> Lambda
pythonAutomated Right-Sizing Recommendations
Continuously recommend right-sizing opportunities:
import boto3
import json
def generate_rightsizing_recommendations():
"""Generate right-sizing recommendations for EC2 and RDS."""
cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')
rds = boto3.client('rds')
recommendations = []
# EC2 right-sizing
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
instance_type = instance['InstanceType']
# Get CPU utilization
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.utcnow() - timedelta(days=30),
EndTime=datetime.utcnow(),
Period=3600,
Statistics=['Average', 'Maximum']
)
datapoints = response['Datapoints']
if datapoints:
avg_cpu = np.mean([p['Average'] for p in datapoints])
max_cpu = max([p['Maximum'] for p in datapoints])
# If consistently underutilized, recommend downsize
if avg_cpu < 20 and max_cpu < 50:
new_type = find_smaller_instance_type(instance_type)
current_cost = get_on_demand_price(instance_type)
new_cost = get_on_demand_price(new_type)
monthly_savings = (current_cost - new_cost) * 730
recommendations.append({
'instance_id': instance_id,
'current_type': instance_type,
'recommended_type': new_type,
'monthly_savings': monthly_savings,
'avg_cpu': avg_cpu,
'max_cpu': max_cpu
})
# RDS right-sizing (similar process)
# ...
return recommendations
pythonSlack Alerts and Notifications
import boto3
import requests
import json
def send_slack_cost_alert(recommendation):
"""Send Slack message with cost optimization recommendation."""
webhook_url = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
message = {
"text": "π° Cloud Cost Optimization Opportunity",
"attachments": [
{
"color": "good",
"fields": [
{
"title": "Resource",
"value": f"EC2 Instance {recommendation['instance_id']}",
"short": True
},
{
"title": "Current Type",
"value": recommendation['current_type'],
"short": True
},
{
"title": "Recommended Type",
"value": recommendation['recommended_type'],
"short": True
},
{
"title": "Monthly Savings",
"value": f"${recommendation['monthly_savings']:,.0f}",
"short": True
},
{
"title": "Avg CPU Utilization",
"value": f"{recommendation['avg_cpu']:.1f}%",
"short": True
},
{
"title": "Max CPU",
"value": f"{recommendation['max_cpu']:.1f}%",
"short": True
}
],
"actions": [
{
"type": "button",
"text": "Review in Console",
"url": f"https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:instanceId={recommendation['instance_id']}"
}
]
}
]
}
response = requests.post(webhook_url, json=message)
print(f"Slack notification sent: {response.status_code}")
pythonMonthly Cost Review Dashboards
# Grafana/Looker dashboard configuration
# Visualizing key metrics:
# - Monthly cloud spend trend
# - Cost by service
# - Cost by team
# - Cost anomalies
# - Utilization metrics
# - Projected spend vs budget
# - Savings achieved this month
python5: Cost Governance & Culture β Making Cost a First-Class Concern
Technology and automation enable cost optimization, but without cultural and organizational change, cost concerns remain an afterthought. This section details the governance and cultural practices that make cost optimization an ongoing norm.
Implementing Team-Level Cost Visibility and Chargeback
The first step: make teams aware of their cloud costs.
Cost Attribution by Team
import boto3
import pandas as pd
def calculate_team_costs():
"""Calculate monthly cloud costs broken down by team."""
ce = boto3.client('ce')
end_date = datetime.utcnow().date()
start_date = end_date - timedelta(days=30)
# Query costs broken down by team tag
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date.strftime('%Y-%m-%d'),
'End': end_date.strftime('%Y-%m-%d')
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'TAG', 'Key': 'team'},
{'Type': 'DIMENSION', 'Key': 'SERVICE'}
]
)
# Transform to dataframe
team_costs = {}
for result in response['ResultsByTime']:
for group in result['Groups']:
team = group['Keys'][0]
service = group['Keys'][1]
cost = float(group['Metrics']['UnblendedCost']['Amount'])
if team not in team_costs:
team_costs[team] = {}
team_costs[team][service] = cost
return team_costs
pythonChargeback Model
Organizations typically implement chargeback models where each team is billed (internally) for their cloud usage:
| Team | Compute | Storage | Data Transfer | Third-party | Total Monthly | Annual |
|---|---|---|---|---|---|---|
| Platform/Infra | $35,000 | $8,000 | $12,000 | $2,000 | $57,000 | $684,000 |
| Product A | $45,000 | $25,000 | $18,000 | $8,000 | $96,000 | $1,152,000 |
| Product B | $28,000 | $15,000 | $8,000 | $4,000 | $55,000 | $660,000 |
| Data Pipeline | $32,000 | $40,000 | $2,000 | $1,000 | $75,000 | $900,000 |
| Development/QA | $18,000 | $5,000 | $1,000 | $2,000 | $26,000 | $312,000 |
Benefits of chargeback:
- Teams become cost-aware (similar to how they're performance-aware)
- Incentivizes right-sizing and cleanup
- Enables cost-driven decision making (e.g., "Should we build on-premises for this workload?")
Cost-Aware Development Practices: Tagging Strategy
Tagging is foundational for cost attribution and governance:
# Terraform resource with cost-aware tags
resource "aws_instance" "web_server" {
ami = "ami-12345678"
instance_type = "t3.large"
tags = {
Name = "web-prod-01"
Team = "platform"
Environment = "production"
Product = "api"
CostCenter = "engineering"
Project = "customer-api-v2"
ManagedBy = "terraform"
CreatedBy = "alice@company.com"
CreatedDate = "2025-11-15"
ReviewDate = "2026-02-15" # For periodic cleanup
}
}
hclTagging best practices:
- Enforce tagging at provisioning time (CloudFormation, Terraform, policies)
- Consistent tag names across the organization
- Review tags periodically to ensure accuracy
- Use tags for automation (e.g., resource cleanup scripts, cost allocation)
Budget Alerts and Enforcement
import boto3
budgets = boto3.client('budgets')
# Create a monthly budget alert
budget = {
'BudgetName': 'cloud-spend-budget-2025',
'BudgetLimit': {
'Amount': '450000', # $450K monthly budget
'Unit': 'USD'
},
'TimeUnit': 'MONTHLY',
'BudgetType': 'COST',
'NotificationsWithSubscribers': [
{
'Notification': {
'NotificationType': 'ACTUAL',
'ComparisonOperator': 'GREATER_THAN',
'Threshold': 90, # Alert when spend exceeds 90% of budget
'ThresholdType': 'PERCENTAGE'
},
'Subscribers': [
{
'SubscriptionType': 'EMAIL',
'Address': 'finance@company.com'
}
]
}
]
}
budgets.create_budget(AccountId='123456789', Budget=budget)
pythonTraining Developers on Cost Implications
Cost awareness should be part of engineering culture:
# Cloud Cost Awareness Training
## Common Cost Mistakes
1. Over-provisioning instances (assume peak load, not average)
2. Leaving resources running in development after use
3. Unoptimized database queries leading to excessive I/O costs
4. Large data transfers between regions without compression
5. Unchecked auto-scaling leading to runaway costs
## Cost-Conscious Architecture Patterns
### Pattern 1: Right-sizing for typical load
- Profile your application to find typical (not peak) resource needs
- Use auto-scaling for peak, but don't over-provision baseline
### Pattern 2: Scheduled shutdown for non-critical environments
- Stop development databases and servers outside work hours
- Use cron jobs or Lambda to automate
### Pattern 3: Batch processing for large workloads
- Use Spot instances for batch jobs (70-90% savings)
- Schedule batch jobs during off-peak hours if possible
### Pattern 4: Leverage managed services
- Use fully managed services (RDS, ElastiCache, S3) instead of self-hosted
- Focus engineering effort on product, not infrastructure
markdown6: Results & ROI Calculation β Proving the Value
After six months of implementation, the engagement delivered measurable, significant results.
Month-by-Month Cost Reduction Timeline
| Month | Cloud Spend | Quick Wins | Architectural | Long-term | Total Savings | Cumulative |
|---|---|---|---|---|---|---|
| Before (baseline) | $450K | - | - | - | - | - |
| Month 1-2 | $445K | -$5K (RI purchase) | - | - | $5K | $5K |
| Month 3 | $420K | -$28K (EBS cleanup, S3 lifecycle) | -$2K (testing) | - | $30K | $35K |
| Month 4 | $380K | -$35K | -$35K (Spot infra) | - | $70K | $105K |
| Month 5 | $340K | -$35K | -$75K (HPA, autoscaling) | -$5K (testing long-term) | $115K | $220K |
| Month 6 | $320K | -$35K (steady state) | -$95K (steady state) | -$10K (scheduling implemented) | $140K | $360K |
| Month 7+ | ~$290K/mo | -$35K | -$95K | -$30K | ~$160K/month ongoing | Monthly rate: -$160K |
Year 1 total: $2.8M in savings (from $5.4M baseline to ~$2.6M run-rate)
Detailed Breakdown of $2.8M Savings by Category
| Optimization Category | Implementation Period | Year 1 Savings | Year 2+ Recurring |
|---|---|---|---|
| Compute Optimization | $1,180,000 | $1,180,000 | |
| - Reserved Instance purchases | Month 1-2 | $480,000 | $480,000 |
| - EC2 right-sizing | Month 3-4 | $240,000 | $240,000 |
| - Spot instance adoption | Month 4-5 | $200,000 | $200,000 |
| - Kubernetes autoscaling tuning | Month 5-6 | $120,000 | $120,000 |
| - Cluster scheduling (on-demand β Spot) | Month 6-7 | $140,000 | $140,000 |
| Storage Optimization | $656,000 | $656,000 | |
| - RDS right-sizing + Multi-AZ removal | Month 3-4 | $288,000 | $288,000 |
| - EBS volume cleanup | Month 3 | $84,000 | $84,000 |
| - S3 lifecycle policies | Month 3 | $92,000 | $92,000 |
| - Database backup optimization | Month 4-5 | $48,000 | $48,000 |
| - Caching layer (Redis) | Month 5-6 | $144,000 | $144,000 |
| Network & Data Transfer | $504,000 | $504,000 | |
| - VPC endpoint implementation | Month 4 | $84,000 | $84,000 |
| - CloudFront optimization | Month 5 | $24,000 | $24,000 |
| - Inter-region data transfer compression | Month 5 | $14,000 | $14,000 |
| - Load balancer consolidation | Month 6 | $12,000 | $12,000 |
| - NAT gateway optimization | Month 6 | $120,000 | $120,000 |
| - Data transfer governance | Month 7 | $250,000 | $250,000 |
| Other Services & Cleanup | $460,000 | $460,000 | |
| - Lambda migration (non-critical services) | Month 6-8 | $95,000 | $95,000 |
| - Unattached resource cleanup (ongoing) | Month 3+ | $120,000 | $120,000 |
| - Third-party service consolidation | Month 4-5 | $135,000 | $135,000 |
| - Zombie resource automated deletion | Month 7 | $110,000 | $110,000 |
| TOTAL | $2,800,000 | $2,800,000 |
Investment Required: Engineering Cost vs Savings
Investment breakdown:
| Cost Category | Unit | Quantity | Cost/Unit | Total |
|---|---|---|---|---|
| Engineering Time | ||||
| Senior architect | months | 3 | $35,000 | $105,000 |
| DevOps engineers | months | 6 | $28,000 | $168,000 |
| Platform engineer | months | 3 | $25,000 | $75,000 |
| Data engineer (FinOps) | months | 2 | $25,000 | $50,000 |
| Tools & Infrastructure | ||||
| FinOps platform setup | - | 1 | $50,000 | $50,000 |
| Monitoring/dashboards (Grafana, etc.) | - | 1 | $10,000 | $10,000 |
| Training and documentation | - | 1 | $15,000 | $15,000 |
| TOTAL INVESTMENT | $473,000 |
ROI Calculation
Year 1 Savings: $2,800,000
Investment: $473,000
Net Benefit Year 1: $2,327,000
ROI = (Net Benefit / Investment) Γ 100
ROI = ($2,327,000 / $473,000) Γ 100
ROI = 492%
Payback Period = Investment / Monthly Savings
Payback Period = $473,000 / $160,000
Payback Period = 2.96 months (~3 months)
Year 2 and beyond: $2.8M annual savings with minimal additional investment (just ongoing maintenance and optimization).
Ongoing Savings Trajectory
Year 1: $2.8M saved
Year 2: $2.8M saved (recurring, no investment needed)
Year 3: $2.8M + additional $0.4M (from new optimizations) = $3.2M
3-Year Total Savings: $8.8M
Graphs and Visual Representation
Graph 1: Monthly Cloud Spend Before and After Optimization
$500,000 β Before After
β (baseline) (optimized)
$450,000 β β±ββββββββββββββββββββββββββββββββ
β β β²
$400,000 β β β²
β β β²
$350,000 β β β²βββββββββ
β β
$300,000 β β
β β βββββββββββββ
$250,000 β β
β βββββββββββββββββββββββββββββββββ
β
$200,000 ββββββββββββββββββββββββββββββββββββββ
M1 M2 M3 M4 M5 M6 M7 M8 M9 ...
Before: $450K/month Γ 12 = $5.4M/year
After: $290K/month Γ 12 = $3.48M/year (run-rate after 6 months)
Savings: $2.8M/year (52% reduction)
Graph 2: Cumulative Savings Over Time
Cumulative Savings
$3,000,000 β ___
β ___ββ
$2,500,000 β ___ββ
β ___ββ
$2,000,000 β ___ββ
β ___ββ
$1,500,000 β ___ββ
β ___ββ
$1,000,000 βββ
β
$500,000 β
β
$0 ββββββββββββββββββββββββββ
M1 M2 M3 M4 M5 M6 M7 ...
After 6 months: $2.8M cumulative savings
After 12 months: $2.8M annual recurring
After 24 months: $5.6M cumulative
Graph 3: Savings by Category
Breakdown of $2.8M Annual Savings
Compute Optimization: $1,180K (42%) ββββββββββββββββββ
Storage Optimization: $656K (23%) ββββββββββ
Network & Data Transfer: $504K (18%) ββββββββ
Other Services & Cleanup: $460K (17%) βββββββ
Conclusion: Building a Replicable Framework for Cloud Cost Optimization
The $2.8M cost reduction achieved by this organization was not the result of luck, vendor discounts, or cutting corners. It was the result of a systematic, technically rigorous, and culturally aligned approach to cloud cost optimization.
The Replicable Framework
Organizations seeking similar results should follow this three-phase framework:
Phase 1: Understand (Weeks 1β4)
- Conduct detailed cost analysis using AWS Cost Explorer, CloudWatch, and custom scripts
- Build cost attribution model by team/product
- Identify top cost drivers and inefficiencies
- Establish baseline metrics and goals
Phase 2: Optimize (Weeks 5β16)
- Execute quick wins (0β30 days): RI purchases, cleanup, lifecycle policies
- Implement architectural changes (30β90 days): Spot instances, autoscaling, caching
- Begin long-term optimizations (90+ days): Multi-region consolidation, serverless migration
Phase 3: Sustain (Ongoing)
- Build FinOps automation platform for continuous monitoring
- Implement team-level cost visibility and chargeback
- Establish cost governance and cultural practices
- Monitor and refine optimizations over time
Common Mistakes to Avoid
Mistake 1: Optimizing the wrong things
- Focus on the top 5β10 cost drivers, not the 100 low-impact items
- Use data and analysis, not assumptions
Mistake 2: Sacrificing performance or reliability for cost
- Optimization should not compromise user experience or availability
- Keep performance monitoring tight alongside cost monitoring
Mistake 3: One-time effort with no follow-up
- Cloud cost optimization is continuous, not a one-off project
- Automate monitoring and recommendations
Mistake 4: Lack of team buy-in
- Get engineering leadership and individual teams involved
- Make cost visible and tied to team incentives
Mistake 5: Ignoring the long tail
- Small items accumulate (sprawl, zombie resources)
- Automate cleanup of resources older than 30 days without active use
Cost Optimization as an Ongoing Practice
Successful organizations treat cost optimization the same way they treat performance optimization or security hardening: as an ongoing, first-class engineering concern with:
- Regular cost reviews (monthly)
- Continuous monitoring and alerting
- Team-level accountability and metrics
- Annual optimization goals and roadmaps
Future Trends in FinOps and Cloud Cost Optimization
As cloud adoption matures, the field of FinOps (Finance + DevOps) continues to evolve:
- FinOps maturity model: Organizations progress from reactive cost-cutting to proactive, predictive cost management
- Tighter cost/performance integration: Optimizing for cost and performance simultaneously (not trade-offs)
- Generative AI for recommendations: ML models that identify optimization opportunities automatically
- Cloud cost currency across providers: Multi-cloud optimization balancing AWS, Azure, and GCP
- Sustainability focus: Optimizing for carbon footprint alongside financial cost
Organizations that adopt a mature FinOps practice now will be best positioned to compete in an era of increasingly efficient, cost-conscious cloud infrastructure.
Appendix: Technical Deep Dives & Code Examples
Deep Dive 1: Cost Calculation Formulas
EC2 On-Demand Cost Calculation
Hourly Cost = Instance Type Hourly Rate (varies by region, OS)
Daily Cost = Hourly Cost Γ 24 hours
Monthly Cost = Daily Cost Γ 30 days
Annual Cost = Monthly Cost Γ 12 months
Example: m5.2xlarge in us-east-1 (Linux)
Hourly Rate: $0.384
Daily Cost: $0.384 Γ 24 = $9.216
Monthly Cost: $9.216 Γ 730 (average hours/month) = $6,727
Annual Cost: $6,727 Γ 12 = $80,724
Reserved Instance Cost Calculation
RI Cost = Upfront Cost + (Hourly Rate Γ Hours in Commitment Period)
Example: 1-year RI, m5.2xlarge, 40% discount
Upfront: $2,400
Hourly Rate (RI): $0.192 (vs $0.384 On-Demand)
Annual Hourly: $0.192 Γ 8,760 hours = $1,681
Total Year 1: $2,400 + $1,681 = $4,081
Data Transfer Cost
Data Transfer Cost = Data Volume (GB) Γ Price per GB
Regional Transfer: $0.02/GB
Internet Egress: $0.09/GB (first 1 GB free, then tiered pricing)
CloudFront: $0.085/GB (US/Canada/Mexico, tiered)
Example: 1 TB of data transferred to Internet
Cost = 1,024 GB Γ $0.09 = $92.16
Final Thoughts
Cloud cost optimization is not about spending less on cloud infrastructure. It's about getting maximum value from cloud spending through technical excellence, architectural thinking, and operational discipline.
Organizations that master cloud cost optimization gain:
- Financial advantage: 30β50% cost reduction translates to significant competitive advantage
- Technical advantage: Well-optimized infrastructure often has better performance and reliability
- Cultural advantage: Cost awareness spreads to all engineering teams
The framework and case study presented in this article provide a replicable roadmap for achieving similar results. The key is to move from reactive, ad-hoc cost-cutting to proactive, systematic, continuous cost optimization as a core engineering practice.