How to Prevent and Recover from Prometheus Cardinality Explosions?

Q: How much RAM does Prometheus need per series?

On average, Prometheus requires approximately 1KB to 2KB of RAM per active time series in the head block. This varies based on label length and scrape interval, but 2KB is a safe capacity planning metric.

Q: Can I delete high-cardinality data without restarting?

You can use the Admin API (POST /api/v1/admin/tsdb/delete_series) to delete series, but this does not immediately reclaim memory. It marks the series as deleted (tombstoned), but the RAM is usually only freed after a restart or during the next compaction cycle.

Key Takeaways

Unbounded Labels Drive RAM Exhaustion: Cardinality explosion occurs when high-entropy label values (like request_id or user_id) create an exponential number of unique time series, forcing the monitoring system to allocate memory beyond physical limits.
Linear Memory Scaling: Prometheus and Thanos Sidecar memory usage scales linearly with the number of active time series because the inverted index and head block must reside in RAM for fast lookups and writes.
Essential Guardrails: Configuring hard limits like sample_limit (maximum series per scrape) and label_limit (maximum labels per metric) prevents a single misconfigured service from crashing the entire monitoring backend.
Audit Mechanisms: The TSDB Status page is the primary tool for identifying high-cardinality metrics before they trigger OOM kills, specifically by analyzing the "Top 10 series count by metric name."
WAL Corruption Recovery: Out-of-Memory kills frequently corrupt the Write-Ahead Log (WAL). Recovery often requires manually removing the corrupted WAL segments, resulting in the loss of recent in-memory data (typically the last 2 hours).

What is a Cardinality Explosion and Why Does It Cause OOM Kills?

A cardinality explosion is the uncontrolled rapid increase in unique time series combinations caused by high-entropy label values, which exhausts system RAM and triggers the Linux kernel's Out-of-Memory (OOM) killer.

Definition: Cardinality in monitoring refers to the number of unique combinations of label names and label values associated with a metric. Cardinality Explosion happens when a label with unbounded values (e.g., UUIDs) is added to a metric, causing the total number of series to grow factorially ($Metric \times LabelA \times LabelB$).

The Mechanism of Memory Exhaustion

Prometheus uses a Time Series Database (TSDB) that optimizes for write throughput and query speed. To achieve this, the "Head Block"—which contains the most recent data (usually the last 2 to 3 hours)—is held almost entirely in memory.

Inverted Index: Prometheus maintains an inverted index in RAM to map labels to series IDs. Every new unique label value creates a new entry.
Chunk Management: Each active series requires a memory structure to hold the current "chunk" of samples before they are compressed and flushed to disk.
Symbol Table: String values for label names and values are stored in a symbol table to deduplicate storage, but unique strings (like random IDs) defeat this deduplication.

When a developer introduces a label like session_id, the number of series does not grow additively; it grows multiplicatively. If you have 100 metrics, 50 pods, and introduce a session_id with 10,000 unique users, you theoretically generate $100 \times 50 \times 10,000 = 50,000,000$ series.

Since Prometheus requires approximately 1KB to 2KB of RAM per active series, 50 million series would require roughly 50GB to 100GB of RAM. If the container limit is 32GB, the process will hit the limit, panic, or be killed by the OS kernel.

How Do You Identify High-Cardinality Metrics in Your Infrastructure?

To identify high-cardinality metrics, you must audit the TSDB statistics to find which specific metric names and label pairs are consuming the most memory.

Using the TSDB Status Page

The most direct method is accessing the built-in status page at /status/tsdb on your Prometheus or Thanos instance. This page queries the internal index without executing a heavy PromQL query. Focus on these three tables:

Top 10 series count by metric name: Identifies the specific metrics (e.g., http_requests_total) that have the most permutations.
Top 10 label names with high memory usage: Identifies which label names (e.g., url, id) exist across the most series.
Top 10 label value pairs by series count: shows the specific label=value combinations driving growth.

Querying High-Cardinality via PromQL

If the instance is still responsive, use PromQL to inspect series counts dynamically.

Find the top 10 metrics by series count:

topk(10, count by (__name__) ({__name__=~".+"}))

Find the highest cardinality labels on a specific metric: If you identify http_request_duration_seconds as the culprit, drill down to find the problematic label:

topk(5, count by (le, path, method, status) (http_request_duration_seconds))

Monitoring Series Growth in Real-Time

You should alert on the rate of series growth to catch explosions before they cause a crash. Monitor the prometheus_tsdb_head_series metric.

# Alert if series count exceeds a safe threshold (e.g., 1 million)
prometheus_tsdb_head_series > 1000000

How Can You Implement Guardrails to Prevent Cardinality Crashes?

You can prevent cardinality crashes by implementing ingestion-side limits that drop dangerous data before it enters the memory-resident index.

1. Scrape Limits (`sample_limit`)

The most effective hard stop is the sample_limit configuration in your scrape_configs. This limit is applied per target. If a single pod exposes more than the allowed number of series (e.g., 5000), Prometheus will fail the scrape entirely rather than ingest the data and crash.

scrape_configs:
  - job_name: 'kubernetes-pods'
    sample_limit: 5000  # Fail scrape if target exposes > 5000 series
    static_configs:
      - targets: ['10.0.1.5:9100']

2. Label Limits

To prevent developers from attaching too many metadata labels, use label_limit and label_name_length_limit.

scrape_configs:
  - job_name: 'app-service'
    label_limit: 20             # Max 20 labels per metric
    label_name_length_limit: 50 # Max 50 chars per label name
    label_value_length_limit: 100 # Max 100 chars per label value

3. Metric Relabeling to Normalize Data

If you cannot change the application code, use metric_relabel_configs to strip or normalize high-cardinality labels at ingestion time.

Example: Dropping a high-cardinality label:

metric_relabel_configs:
  - action: labeldrop
    regex: (uid|session_id|token)

Example: Normalizing a URL path: This replaces specific ID segments in a path (e.g., /api/user/1234/details) with a generic placeholder (/api/user/:id/details), collapsing thousands of series into one.

metric_relabel_configs:
  - source_labels: [path]
    regex: '/api/user/[0-9]+/(.*)'
    target_label: path
    replacement: '/api/user/:id/$1'

4. Leveraging Native Histograms (Prometheus 3.0+)

Traditional histograms create a separate time series for every bucket (le="0.1", le="0.5", etc.). A histogram with 10 buckets multiplies cardinality by 10.

Prometheus 3.0+ introduces Native Histograms (also known as sparse histograms). These store bucket data within a single series rather than spreading it across multiple series. Migrating to native histograms can reduce the index memory footprint of histogram metrics by 80-90%.

How Do You Recover from WAL Corruption After an OOM Kill?

Recovering from WAL corruption requires identifying the corrupted segments and removing them so the Write-Ahead Log replay can proceed, usually at the cost of recent data.

When Prometheus OOMs, it often terminates while writing to the WAL. Upon restart, it attempts to replay the WAL to restore the in-memory state. If the last segment is half-written, Prometheus will panic and crash loop with errors like "checksum mismatch" or "unexpected EOF".

Step-by-Step Recovery Procedure

1. Analyze the Failure Log Check the startup logs for specific corruption errors:

level=error msg="Opening storage failed" err="log series: replay: unexpected EOF"

2. Use promtool to Analyze (Optional) If you have access to the volume without starting the main process, use promtool to identify the specific corrupted block or segment.

promtool tsdb analyze /path/to/data/

3. Remove Corrupted WAL Segments The WAL is stored in the /data/wal directory. The files are numbered sequentially (e.g., 000055, 000056).

Method A (Conservative): Move the highest numbered file out of the directory. This usually sacrifices the last few minutes of data.
Method B (Aggressive): If Method A fails, delete the entire wal directory. Warning: This causes the loss of all data currently in the Head block (typically the last 2-3 hours) that has not yet been compacted into a persistent block.

4. Start with WAL Truncation Enabled In some versions, you can start Prometheus with a flag that attempts to auto-repair the WAL by ignoring the corrupted tail:

./prometheus --storage.tsdb.wal-compression --no-scrape.adjust-timestamps

Note: Flags vary by version; check your specific binary help.

Why Should You Avoid Unbounded Labels Like pod_id and request_id?

Unbounded labels effectively turn time-series metrics into event logs, a data pattern that the Prometheus TSDB architecture is not designed to support efficiently.

Metrics vs. Logs

Metrics are for aggregating the overall health and performance of a system (e.g., "What is the 99th percentile latency?").
Logs are for inspecting individual events (e.g., "Why did request ID 555 fail?").

When you add request_id to a metric, you create a unique time series for every request. Since Prometheus indexes every series, this results in an index that grows infinitely.

Definition: Series Churn refers to the rate at which old series become inactive and new series are created. High churn creates "zombie" series that sit in memory (and the index) until the head block is compacted and they eventually age out, typically after 2 to 4 hours.

The Problem with Ephemeral IDs

Labels like pod_id or container_id in Kubernetes environments cause massive churn. Every time a deployment occurs, the old pod_id series become inactive, and new ones are created.

Memory Bloat: The "dead" series from the previous deployment remain in the inverted index for hours.
Slow Queries: Queries that aggregate over time (e.g., rate(http_requests[1d])) must scan thousands of discontinuous series, causing query timeouts.

The Solution: Exemplars

Instead of adding high-cardinality IDs as labels, use Exemplars. Exemplars allow you to attach a trace ID or request ID to a metric value without making it part of the series identity (labels). This links your metrics to your traces without exploding the cardinality of the TSDB index.

Frequently Asked Questions

How much RAM does Prometheus need per series?

On average, Prometheus requires approximately 1KB to 2KB of RAM per active time series in the head block. This varies based on label length and scrape interval, but 2KB is a safe capacity planning metric.

Does Thanos handle high cardinality better than Prometheus?

No, the ingestion components of Thanos (Sidecar and Receiver) rely on the same TSDB code as Prometheus and face the exact same memory constraints. While Thanos allows for long-term storage of high-cardinality data in object storage, the initial ingestion layer will still OOM if cardinality explodes.

Can I delete high-cardinality data without restarting?

You can use the Admin API (POST /api/v1/admin/tsdb/delete_series) to delete series, but this does not immediately reclaim memory. It marks the series as deleted (tombstoned), but the RAM is usually only freed after a restart or during the next compaction cycle.

What is the difference between series count and churn rate?

Series count is the total number of active time series currently in memory. Churn rate is the number of new series created per second or minute. A system can have a low total series count but a high churn rate (e.g., short-lived batch jobs), which still stresses the index and WAL.

Why do queries trigger OOMs even if ingestion is stable?

Querying high-cardinality metrics requires loading all matching series into memory to perform aggregations (like sum or rate). If a query matches 1 million unique series, the query engine must allocate memory for all of them simultaneously, often spiking usage well above the baseline required for ingestion.