How to Reduce Observability Costs by 70%
Your monitoring bill doesn't have to grow faster than your infrastructure. Here are battle-tested strategies to cut costs without losing visibility.
Observability costs are spiraling out of control. Engineering teams that once paid a few hundred dollars a month now face bills in the tens of thousands—sometimes hundreds of thousands. And the kicker? Most of that data is never looked at.
We've helped companies reduce their observability spend by 70% or more. Not by sacrificing visibility, but by being smarter about what data they collect, how they store it, and what platform they use.
Here's the playbook.
Why Observability Costs Explode
Before we fix the problem, let's understand it. Observability costs grow exponentially for three reasons:
- Log volume scales with traffic — Every request generates log lines. 10x traffic = 10x logs.
- Pricing is per-GB — Most vendors charge $0.10-$3.00 per GB ingested. Add cardinality charges for metrics, and costs compound.
- Retention requirements grow — Compliance often requires 90+ days of retention, multiplying storage costs.
The result? A company processing 100GB/day of logs might pay:
| Vendor | Log Ingestion (100GB/day) | 30-Day Retention | Monthly Cost |
|---|---|---|---|
| Datadog | $0.10/GB | Included (15 days) | $3,000+ |
| Splunk Cloud | $150/GB (indexed) | Extra cost | $15,000+ |
| New Relic | $0.30/GB (pay-as-you-go) | Included | $9,000+ |
| Self-hosted | Compute costs only | Storage costs | $500-2,000 |
Let's fix this.
Strategy 1: Intelligent Sampling
Not all data is equally valuable. A health check that runs every 5 seconds doesn't need the same treatment as an error in your payment system.
Head-Based Sampling for Traces
Sample traces at the entry point based on rules:
# OpenTelemetry Collector config
processors:
probabilistic_sampler:
sampling_percentage: 10 # Keep 10% of normal traces
tail_sampling:
decision_wait: 10s
policies:
# Always keep errors
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Always keep slow requests
- name: slow-requests
type: latency
latency: {threshold_ms: 1000}
# Sample everything else
- name: default
type: probabilistic
probabilistic: {sampling_percentage: 5}
This approach typically reduces trace volume by 80-90% while keeping 100% of the traces that matter.
Log Level Filtering
Debug and info logs are useful during development but rarely in production:
# Keep errors and warnings, sample info/debug
processors:
filter:
logs:
exclude:
match_type: regexp
bodies:
- ".*health.*check.*"
- ".*DEBUG.*"
transform:
log_statements:
- context: log
conditions:
- severity_number < 9 # Below WARN
statements:
- set(attributes["sampled"], true) where Random() > 0.1
Filtering health check logs alone can reduce log volume by 20-40% for most applications.
Strategy 2: Data Tiering
Not all data needs to be instantly queryable. Implement a tiered storage strategy:
- Hot tier (0-7 days) — Fast, expensive storage. Full query capabilities.
- Warm tier (7-30 days) — Slower queries acceptable. Compressed storage.
- Cold tier (30-365 days) — Archive. Restore before querying.
With ClickHouse (which powers Qorrelate), this is built-in:
-- Automatic data tiering by age
ALTER TABLE logs MODIFY TTL
timestamp + INTERVAL 7 DAY TO VOLUME 'hot',
timestamp + INTERVAL 30 DAY TO VOLUME 'warm',
timestamp + INTERVAL 90 DAY TO VOLUME 'cold',
timestamp + INTERVAL 365 DAY DELETE;
Cold storage on S3 costs ~$0.023/GB/month vs. $0.10+/GB/month for hot SSD storage—a 75% reduction for older data.
Strategy 3: Cardinality Control
Metrics cardinality is the silent killer of observability budgets. A single poorly-labeled metric can generate millions of unique time series.
Bad Example
# This creates a unique metric for every request ID!
metrics.increment('request_processed',
tags={'request_id': request.id}) # 💀 Infinite cardinality
Good Example
# Fixed set of labels = bounded cardinality
metrics.increment('request_processed',
tags={
'service': 'checkout',
'endpoint': '/api/v1/order',
'status': '200'
})
Audit your metrics regularly:
-- Find high-cardinality metrics in ClickHouse
SELECT
metric_name,
uniq(attributes) as unique_combinations
FROM metrics
WHERE timestamp > now() - INTERVAL 1 DAY
GROUP BY metric_name
ORDER BY unique_combinations DESC
LIMIT 20;
Strategy 4: Aggregate Before You Ship
The OpenTelemetry Collector can aggregate data before it leaves your infrastructure, dramatically reducing egress and ingestion costs:
processors:
# Aggregate metrics every 60 seconds
batch:
timeout: 60s
send_batch_size: 10000
# Combine identical log entries
groupbyattrs:
keys:
- service.name
- severity
- message
# Pre-aggregate metrics
metricstransform:
transforms:
- include: http.request.duration
action: aggregate
aggregation_type: histogram
submit_every: 60s
Pre-aggregation can reduce metric volume by 10-50x depending on your cardinality.
Strategy 5: Choose the Right Platform
Platform choice matters more than anything else. Here's a realistic comparison for a mid-sized company (500GB/day logs, 100M metrics, 10M spans):
| Platform | Monthly Cost | Notes |
|---|---|---|
| Datadog | $25,000-50,000 | Per-host + ingestion + add-ons |
| Splunk Cloud | $40,000-100,000 | Indexed GB pricing |
| New Relic | $15,000-30,000 | User-based + data |
| Grafana Cloud | $8,000-15,000 | Usage-based |
| Qorrelate | $2,000-5,000 | ClickHouse-powered, transparent pricing |
Why the Difference?
Traditional observability vendors use storage and compute architectures designed in the 2010s. They're optimized for flexibility, not cost.
Modern platforms like Qorrelate use ClickHouse, a columnar database that:
- Compresses log data 10-20x better than Elasticsearch
- Queries billions of rows in milliseconds
- Scales horizontally without vendor lock-in
- Supports OpenTelemetry natively (no translation overhead)
Many vendors charge extra for: custom metrics, session replay, synthetic monitoring, extra users, API access, and enterprise features. Always calculate your "all-in" cost.
Strategy 6: Set Budgets and Alerts
You can't manage what you don't measure. Set up cost alerts:
# Cost alerting example
alerts:
- name: log_ingestion_spike
condition: rate(logs_bytes_ingested[1h]) > 2 * avg(logs_bytes_ingested[7d])
action: slack_notify
- name: approaching_monthly_budget
condition: sum(logs_bytes_ingested[this_month]) > 0.8 * budget_bytes
action: email_finance
Many cost overruns happen because a deployment introduced a new chatty log or a bug caused a retry storm. Catching these early saves thousands.
Real-World Case Study
A Series B startup came to us spending $42,000/month on Datadog. After our optimization:
- Sampling reduced trace volume by 85%
- Log filtering removed 40% of noise
- Cardinality fixes cut metrics by 60%
- Platform migration to Qorrelate
New monthly cost: $8,500—an 80% reduction with no loss in debugging capability.
Action Items for This Week
- Audit your logs — Find the noisiest sources. Health checks? Debug statements? Eliminate or sample them.
- Check metric cardinality — Look for unbounded labels. Fix the top 5 offenders.
- Review retention — Do you really need 90 days of hot storage? Tier your data.
- Calculate your real cost — Add up all observability vendors. Include hidden fees.
- Evaluate alternatives — Get quotes from multiple vendors. The market has changed.
We offer free observability cost audits. Send us your current bill and data volumes, and we'll show you exactly where you can cut costs. Book a call →
Conclusion
Observability is essential. Overpaying for it isn't.
By implementing intelligent sampling, data tiering, cardinality control, and choosing a modern platform, most companies can reduce their observability spend by 50-80% while maintaining—or even improving—their debugging capabilities.
Your infrastructure is growing. Your observability bill doesn't have to grow with it.