How to Resolve Silent Metric Capping in OpenTelemetry SDKs
Learn how to detect and fix the default 2,000 cardinality cap in OpenTelemetry SDKs. Master the Views API and alerting strategies for otel.metric.overflow.
Key Takeaways
- Default Safety Limits: OpenTelemetry SDKs enforce a default limit of 2,000 unique attribute combinations (cardinality) per metric instrument to prevent application memory exhaustion and Out-Of-Memory (OOM) crashes.
- Silent Overflow Aggregation: When the cardinality limit is reached, the SDK does not throw an exception; instead, it aggregates all subsequent unique attribute sets into a single bucket labeled
otel.metric.overflow=true. - Detection Strategy: Because the failure is silent, detection requires proactive monitoring of your metrics backend for the
otel.metric.overflowattribute or enabling SDK debug logging in non-production environments. - Resolution via Views: The primary mechanism for managing cardinality is the OpenTelemetry Views API, which allows developers to filter specific attributes or define allow-lists before aggregation occurs.
- Data Architecture: High-cardinality data points, such as
user_id,order_id, orsession_id, should be recorded as Exemplars or Logs, not as Metric Attributes.
Why Does the OpenTelemetry SDK Silently Cap Metric Cardinality?
The OpenTelemetry SDK enforces a hard limit of 2,000 unique attribute combinations per metric instrument to protect the host application from memory exhaustion caused by cardinality explosion. This "fail-safe" mechanism prioritizes application stability over telemetry granularity.
Definition: Cardinality Explosion occurs when a metric dimension (attribute) accepts a high volume of unique values (e.g., UUIDs), causing the number of unique time series to grow exponentially. This forces the telemetry SDK to allocate excessive memory to track state for every unique combination.
The Mechanics of Silent Failure
In metric systems, the SDK must maintain a stateful record in memory for every unique combination of attributes associated with a metric (e.g., a Counter or Histogram). If a developer accidentally adds a high-cardinality attribute like a request_id to a counter, the SDK would theoretically need to store millions of unique records in the application heap.
To prevent this, the SDK stops creating new storage slots once the limit (default 2,000) is hit. Instead of dropping the data entirely—which would result in inaccurate total counts—the SDK maps all new attribute combinations to a special "overflow" bucket.
This process is silent by design. If the SDK were to log an error for every incoming metric point after the limit was reached, it would trigger a "log explosion," consuming I/O resources and potentially destabilizing the application further. Therefore, the application continues running smoothly, and the total metric count remains accurate, but the dimensional data for the excess points is lost.
It is important to note that the OpenTelemetry Collector will appear healthy during this event. The bottleneck and data transformation happen inside the application process (the SDK instrumentation layer) before the data ever reaches the wire/exporter.
How Do You Identify the 'otel.metric.overflow' Attribute in Your Dashboards?
You can identify metric capping by querying your observability backend for the specific dimension otel.metric.overflow with a value of true, or by observing "staircase" patterns where dimension counts flatline abruptly.
Detecting the Overflow Flag
When the SDK collapses data into the overflow bucket, it appends the attribute otel.metric.overflow=true. This is the definitive signal that cardinality limits have been breached.
In a Prometheus-compatible query language (PromQL), you can scan your entire infrastructure for this event using the following query:
count({otel_metric_overflow="true"}) by (__name__, job)
If this query returns results, the metric names listed are actively suffering from cardinality capping.
Visual Patterns of Failure
If you do not query for the flag directly, you may notice specific visual artifacts in your dashboards:
- The Flat-line: A graph showing the count of unique values for a specific attribute (e.g.,
customer_type) will rise normally and then hit a perfect ceiling (e.g., exactly 2,000). - The "Other" Spike: If you visualize your metrics by dimension, you will see a sudden appearance of a new series (the overflow bucket) that grows rapidly while new unique dimensions fail to appear.
Internal SDK Logging
While the failure is silent at the application runtime level, the OpenTelemetry SDKs do emit internal logs. By enabling the OTel Logging SDK or configuring a StandardOutLogRecordExporter in a staging environment, you can capture warning logs indicating that the aggregator has reached its capacity.
How Do You Configure or Override Default Cardinality Limits?
You can configure cardinality limits by using the OpenTelemetry Views API to define aggregation boundaries or, in some language implementations, by setting environment variables to adjust the global limit.
Definition: View is an OpenTelemetry concept that allows the SDK to customize how a metric is processed before it is exported. It can change the name, description, aggregation type, or filter the attributes of a metric stream.
Using the Views API (Recommended)
The most robust way to handle cardinality is to use a View to explicitly select which attributes to keep. This prevents the limit from being reached by discarding high-cardinality data before aggregation.
Go Example: Filtering Attributes
// Create a View that drops the "user_id" attribute from the "http.server.duration" metric
view := metric.NewView(
metric.Instrument{Name: "http.server.duration"},
metric.Stream{
AttributeFilter: attribute.NewAllowKeysFilter("http.method", "http.status_code"),
},
)
provider := metric.NewMeterProvider(
metric.WithReader(reader),
metric.WithView(view),
)
Python Example: Changing Aggregation
# Define a view to limit attributes
view = View(
instrument_name="http.server.duration",
attribute_filter=lambda key: key in ["http.method", "http.status_code"]
)
provider = MeterProvider(views=[view])
Adjusting Global Limits via Environment Variables
In Java, recent versions of the SDK allow you to adjust the global cardinality limit via environment variables. This is a blunt instrument and should be used with caution.
# Increase the limit from 2,000 to 4,000
export OTEL_METRICS_CARDINALITY_LIMIT=4000
Memory Implications
Before increasing the limit, you must profile your application's memory usage. Each additional unique attribute set consumes heap memory. Increasing the limit to "infinity" or a very high number (e.g., 100,000) creates a significant risk of OOM crashes during traffic spikes or DDOS attacks where random attribute values might be generated.
What Are the Best Practices for Preventing Cardinality Explosions?
The best practice for preventing cardinality explosions is to strictly segregate low-cardinality metadata (Attributes) from high-cardinality identifiers (Exemplars/Logs) and to implement "Allow-list" filtering via Views.
Attributes vs. Exemplars vs. Logs
Do not treat Metrics as Logs. Metrics are for aggregating trends; Logs are for inspecting specific events.
| Feature | Cardinality | Usage | Example Data |
|---|---|---|---|
| Metric Attribute | Low (< 100s) | Aggregation, slicing, dicing trends. | region, status_code, app_version |
| Exemplar | High (Unlimited) | Contextual data attached to a specific metric point. | trace_id, span_id |
| Log/Trace | High (Unlimited) | High-fidelity debugging and transaction tracking. | user_id, order_uuid, error_stack |
Definition: Exemplar is a specific sample data point attached to a metric aggregation (like a histogram bucket) that contains high-cardinality context, such as a Trace ID, allowing you to jump from a metric chart to a relevant trace without bloating the metric cardinality.
Implement Attribute Allow-lists
Instead of trying to block bad attributes (Deny-list), use the Views API to implement an Allow-list. Explicitly define the 3-5 dimensions that are critical for aggregation (e.g., method, status, service). The SDK will drop any other unapproved attributes automatically. This protects your application from developers accidentally adding user_id to a metric in the future.
Cardinality Budgeting
Establish a "Cardinality Budget" during the service design phase. If a team needs to track metrics per customer, and you have 10,000 customers, metrics is likely the wrong tool unless you are using a specialized backend. Push this data to logs or structure it as a distinct analytical event stream.
How Should You Set Up Alerting for Metric Capping?
You should set up alerting by creating a global monitor for the otel.metric.overflow=true attribute and monitoring SDK self-telemetry metrics regarding dropped data points.
Global Overflow Alert
Configure your observability platform to trigger an alert whenever the overflow attribute appears. This is a high-fidelity signal that data quality is degrading.
- Alert Condition:
sum(rate({otel_metric_overflow="true"}[5m])) > 0 - Severity: Warning
- Action: Investigate recent code deployments for new metric instrumentation.
Monitoring SDK Self-Telemetry
Many OpenTelemetry SDKs emit their own metrics (self-diagnostics). Look for metrics named similar to otel_sdk_metrics_dropped_data_points_total or otel_sdk_memory_usage.
Missing Data Alerts
Silent failures often manifest as "missing data" rather than error spikes. Set up alerts on your critical Key Performance Indicators (KPIs). If the volume of http_requests_total drops significantly without a corresponding drop in system traffic (or without an increase in error rates), it may indicate that the metric stream is being capped or malformed.
CI/CD Integration
Integrate cardinality checks into your CI/CD pipeline. Run load tests in a staging environment that generate randomized traffic. Query the staging metrics backend for the overflow flag during the test. If the flag is detected, fail the build to prevent the high-cardinality instrumentation from reaching production.
Frequently Asked Questions
Does the 2,000 limit apply to Spans and Traces?
No, the cardinality limit primarily affects Metrics. Metrics require long-lived, stateful aggregations in memory (e.g., keeping a count for every unique key). Traces and Spans are generally stateless event streams that are sampled and exported immediately, so they do not suffer from the same memory accumulation issues.
Can I disable the overflow bucket entirely?
While technically possible by increasing limits to unreachable numbers, disabling the safety mechanism is strongly discouraged. The limit exists to prevent your application from crashing (OOM) during traffic spikes or malicious attacks that generate random attribute values. It is better to lose metric granularity than to lose the application service itself.
Is the limit shared across all metrics?
No, the 2,000-cardinality limit is typically applied per-instrument. This means http.server.duration has its own limit of 2,000, and db.client.connections has a separate limit of 2,000. One noisy metric will not cause data loss in other, well-behaved metrics.
Why doesn't the OTel Collector fix this?
The OpenTelemetry Collector cannot fix this because it sits downstream from the application. The SDK (running inside your app) aggregates the data before sending it to the Collector. By the time the Collector receives the data, the overflow aggregation has already happened.
Does this affect all SDK languages?
Yes, most production-ready OpenTelemetry SDKs (including Java, Go, Python, and JavaScript) implement this specification-defined safety limit. However, the default numeric value (2,000) and the exact configuration method (Environment Variable vs. View API) may vary slightly between language implementations.