Category: CloudWatch

  • How to reduce Cloudwatch Costs

    Most CloudWatch bills are driven by high-volume logs, overly granular custom metrics, and too many alarms. Cut costs by logging less and smarter, shortening retention, using the Infrequent Access storage class for old logs, avoiding metric filters on hot log streams, right-sizing metrics/alarms, and pushing bulky access/flow logs to S3 instead of CloudWatch.

    Key Takeaways

    – Logs dominate spend: reduce verbosity, drop/trim big fields, sample, and set short retention per log group.
    – Use CloudWatch Logs Infrequent Access (IA) for older logs and export to S3 for long-term analytics; query IA only when needed to avoid retrieval fees.
    – Replace metric filters on busy logs with direct custom metrics (batched PutMetricData or EMF) and keep dimensions low cardinality.
    – Consolidate alarms with metric math and composite alarms; avoid high-resolution metrics unless required.
    – Keep queries cheap: narrow time windows and pre-filter before parse; store long-term, high-volume logs (ALB/CloudFront/VPC Flow) in S3, not CloudWatch.
    – Automate governance so new log groups get sane defaults for retention, class transitions, and subscription filters.

    Main explanation

    What you pay for in CloudWatch

    – Logs: charged per-GB ingested and stored; plus per-GB scanned by Logs Insights and some processing features (subscriptions, data protection).
    – Metrics: charged per custom metric time series (metric + dimension set) and for higher-resolution storage.
    – Alarms and dashboards: charged per alarm and per dashboard; anomaly detection and composite alarms follow alarm pricing.
    – Extras: synthetics canaries, RUM, contributor insights, metric streams—each has its own line item.

    1) Reduce log ingestion at the source

    – Lower verbosity: default to INFO/WARN in production; enable DEBUG only with sampling or temporary overrides.
    – Sample noisy events: log 1/N requests, or probabilistically log based on trace ID. Log every error; sample successes.
    – Trim payloads: never dump full HTTP bodies, JWTs, or stack traces on every call. Truncate to a few hundred bytes and include a request ID to fetch the full copy elsewhere if needed.
    – Drop chatty frameworks: configure web servers, SDK retries, and health checks not to spam logs.
    – Filter before shipping: with Fluent Bit/CloudWatch Agent, use parsers and filters to drop unneeded fields/lines. For EKS/ECS, add filter rules per namespace or service.

    Gotcha: Lambda’s default START/END/REPORT lines add up across high-QPS functions. Use a structured logger, avoid echoing large context objects, and consider telemetry extensions if you route logs elsewhere.

    Real-world: We cut ~35% of a client’s CloudWatch Logs bill by sampling 1/100 successful requests and truncating response payloads to 256 bytes; error logs remained full fidelity.

    2) Right-size retention and storage class

    – Set per-log-group retention: most app logs don’t need “Never expire.” Common defaults: 7–30 days for apps, 90 days for security/audit, longer only where required.
    – Use CloudWatch Logs Infrequent Access (IA): transition older logs (e.g., after 30 days) to IA for lower storage costs, and keep hot data in Standard.
    – Export to S3 for archival: for year-scale retention or heavy analytics, schedule exports or subscribe streams to S3 (via Kinesis Firehose). Query with Athena when needed.

    Gotcha: Querying IA-tier logs or rehydrating large ranges can trigger retrieval charges and be slower. Scope queries tightly.

    3) Keep Logs Insights queries cheap

    – Narrow time windows first, then expand if needed.
    – Filter early, parse late: start with fields @timestamp, @logStream, @message filters; only then | parse and | stats.
    – Target specific log groups instead of “All log groups.”
    – Save frequent queries (narrowed) and use parameters to avoid scanning days by accident.
    – Prefer metrics for dashboards: don’t back dashboards by wide Logs Insights queries that scan GBs every minute.

    4) Replace metric filters on hot logs

    – Avoid metric filters on high-volume groups (e.g., API access logs). CloudWatch has to inspect every event.
    – Emit custom metrics directly: batch PutMetricData (up to 20 metrics per call) from your service, or use Embedded Metric Format (EMF) to have the agent extract metrics from structured logs on the producer side.
    – Pre-aggregate metrics: count at source (e.g., error_count{service,endpoint}) instead of per-request dimensions.

    Gotcha: High-cardinality dimensions (user_id, request_id) explode custom metric count. Stick to low-cardinality sets like {service, endpoint, status} or use distributions/histograms for latency percentiles instead of per-path metrics.

    5) Consolidate and right-size alarms

    – Use metric math: one alarm per service using OR/AND of key SLOs beats dozens of per-metric alarms.
    – Composite alarms: suppress noise and reduce count by grouping related alarms.
    – Prefer 1-minute resolution only where needed; avoid high-resolution (10s/1s) custom metrics unless truly latency sensitive.
    – Delete orphaned alarms tied to retired metrics or ASGs.
    – For EC2, disable Detailed Monitoring (1-minute) if 5-minute granularity is fine.

    6) Log routing: when CloudWatch isn’t the right sink

    – Put bulky, append-only logs in S3: ALB/NLB access logs, CloudFront logs, and VPC Flow Logs belong in S3 for cheap storage and Athena queries. Only mirror “hot” signals (error counts) to CloudWatch as metrics.
    – For EKS/ECS, consider dual routing: concise app logs to CloudWatch; verbose debug/trace logs to S3/OTel backend.
    – For Lambda, evaluate sending application logs to an external destination via Telemetry API if CloudWatch search isn’t needed.

    Gotcha: Subscribing a log group to Lambda/Kinesis adds extra processing costs (plus Lambda invokes). If your goal is S3 archival, prefer Firehose directly from the agent or source when available.

    7) Governance and automation

    – Auto-enforce retention and IA transition: a small Lambda (or SCP/CloudWatch Logs policies) that sets retention and class on new log groups prevents “never expire” drift.
    – Tag log groups and metrics with owner/cost-center; build a cost-by-tag view to find top talkers.
    – Periodic cleanup: delete stale dashboards, contributor insights rules, and alarms with no recent evaluations.
    – Cap concurrency on chatty producers (e.g., batch workers) to avoid unexpected log spikes.

    Cost quick wins checklist

    – Set app log retention to 14–30 days; transition to IA after 30–60 days.
    – Move ALB/CloudFront access logs and VPC Flow Logs to S3; stop sending them to CloudWatch where possible.
    – Remove metric filters on hot log groups; emit batched custom metrics instead.
    – Consolidate alarms with metric math/composite alarms; drop high-res metrics you don’t use.
    – Tighten Logs Insights queries and stop auto-refresh dashboards that scan large windows.
    – Turn off EC2 Detailed Monitoring where 5-minute is acceptable.
    – Add sampling and payload truncation to application loggers.

    Real-world: One team saved ~60% on CloudWatch by (1) setting 30-day retention + IA at 30 days, (2) exporting month-old logs to S3, (3) replacing 12 metric filters on API logs with 4 batched custom metrics, and (4) merging 80 alarms into 12 composite alarms tied to SLOs.

    Conclusion

    To shrink CloudWatch costs, attack logs first (volume, retention, IA), then metrics (cardinality, batching), then alarms (consolidation). Keep heavy access/flow logs in S3, use metrics for dashboards, and query narrowly with Logs Insights. Automate guardrails so savings stick as new services launch.