Sampling Strategies in Observability

Sampling Strategies in Observability

Balancing data collection is critical in system monitoring. Collect too much, and you risk an overflow of information and budget overrun. Collect too little, and you could miss important errors, leaving critical gaps.

Although there is much written in numerous books about strategies for collecting telemetry data, it was during a presentation on common pitfalls in observability that one of our technical leaders, Dan Massey , distilled it down into three concise points that offer a good perspective:

  1. Collect 100% of error telemetry: This ensures no potential system failures are overlooked.
  2. Stay within budget for remaining telemetry: Beyond errors, choose additional data to collect wisely within the budget.
  3. Re-evaluate if errors exceed your budget: If the error volume is too high, it's a red flag indicating systemic issues that go beyond mere data collection.

Dan also stressed the importance of an organization-wide strategy for data collection, for a holistic view of system health.?There's nothing more frustrating than a distributed trace with missing spans, especially when the source of an issue lies within the missing data.

In this article, we will delve into different data sampling strategies, with a focus on traces and logs. The goal is intelligent data collection decisions that provide the necessary data for understanding our distributed system without exceeding the allocated budget.

We'll explore three sampling strategies:

  • Fixed Probability Sampling
  • Adaptive Sampling
  • Tail-based Sampling

Fixed Probability Sampling

Fixed Probability Sampling, often referred to as consistent or constant sampling, is a straightforward method of data sampling in system observability. This strategy involves collecting a fixed fraction of telemetry data, such as traces or logs, without considering the data's type or value.

In a Fixed Probability Sampling strategy, each trace has an equal chance of being selected, based on a pre-set sampling rate. For instance, you might decide to collect 10% of traces, randomly distributed across all requests. This strategy ensures a broad, representative sample of your system's performance and behavior, making it straightforward to understand the overall distribution of requests that resulted in errors or latency.

The primary advantage of fixed probability sampling is its predictability. The fixed sampling rate allows you to estimate the volume of data you'll collect, which aids in managing storage and analysis costs.

However, a key limitation of Fixed Probability Sampling is its lack of prioritization. Important error traces could be missed if they fall within the unsampled portion of data, making this strategy less effective in error-prone environments.

Fixed Probability Sampling decisions are typically made at the service entry point, where the trace is initially created. The sampling decision and rate are then propagated to other services through headers in the request, commonly referred to as the trace context. All services participating in a trace use this context to decide whether to sample their spans or not.

There is a detailed write-up on how OpenTelemetry encodes sampling decisions into requests here: https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling

Configuring Fixed Probability Sampling is straightforward. For instance, using the OpenTelemetry SDK in Java, the code would look something like the following:

??openTelemetrySdk =
????OpenTelemetrySdk.builder()
??????.setTracerProvider(
????????SdkTracerProvider.builder()
  ??????????.setSampler(Sampler.traceIdRatioBased(0.1))
  ??????????.build())
  ??????.build();        

This code sets the sampling rate at 10%. In real-world applications, this value of 0.1 should be externalized to a configuration, and the exporter needs to be set up to send the traces to a collector backend.

Ensuring that all services follow the sampling decision is crucial to obtaining a complete picture of a request as it flows through the system. If one or two services don't adhere to the sampling decision, it can lead to frustrating debugging sessions.

When used wisely, Fixed Probability Sampling can effectively strike a balance between data comprehensiveness and budget constraints. However, it's vital to keep its limitations in mind when dealing with error telemetry and adjust your strategies accordingly.

Adaptive Sampling

Unlike Fixed Probability Sampling, Adaptive Sampling is a dynamic sampling strategy that adjusts the sampling rate based on the volume of incoming requests and the types of traces. This strategy is designed to ensure that the most relevant data is sampled and collected for observability.

Adaptive Sampling is a sophisticated technique that modifies sampling rates based on specific conditions. In high-traffic scenarios, it may lower the sampling rate to stay within the data budget. However, it can also be set to raise the sampling rate for certain types of requests known to be error-prone or those where the impact of an error is high. This allows for a more efficient allocation of your data budget by prioritizing the collection of high-value data.

A considerable benefit of Adaptive Sampling is its ability to fine-tune data collection based on system behavior. By reducing sampling during high-traffic periods, it avoids data overload and keeps the focus on crucial information. Similarly, by increasing sampling for error-prone or high-impact requests, it ensures more detailed information is available for these critical scenarios. This flexibility is particularly valuable for catching and diagnosing intermittent issues or those that occur under unique conditions.

While Adaptive Sampling offers more control over data volume, it presents its own challenges. Understanding the distribution of conditions across requests, such as distinguishing error rates from normal requests, can be complicated due to the shifting sampling rate.

Like Fixed Probability Sampling, Adaptive Sampling decisions are made at the service entry point. So this strategy may overlook some error conditions. This is because decisions are made at the start of a request, without the benefit of a complete picture that can only be obtained by analyzing the full request and its associated traces.

To implement Adaptive Sampling, you would typically employ an observability framework that supports it. OpenTelemetry (at least to my knowledge) doesn't include an Adaptive Sampler. However one might create a custom sampler that looks something like the following C# code snippet.

public override SamplingResult ShouldSample(in SamplingParameters samplingParameters
{
    var now = DateTime.Now;
    if (now > lastSamplingTime.AddSeconds(1))
    {
        // Adjust the sampling rate based on the last second's data
        currentSamplingRate *= targetTracesPerSecond / (double)tracesThisSecond;
        currentSamplingRate = Math.Min(Math.Max(currentSamplingRate, 0.0), 1.0);

        // Reset for the next second
        lastSamplingTime = now;
        tracesThisSecond = 0;
    }

    tracesThisSecond++;

    // Use the current sampling rate to decide if this trace should be sampled
    bool isSampled = samplingParameters.ParentContext.TraceFlags.HasFlag(ActivityTraceFlags.Recorded) ||
                     new Random().NextDouble() < currentSamplingRate;
    
    var decision = isSampled ? SamplingDecision.RecordAndSample : SamplingDecision.Drop;

    return new SamplingResult(decision);
})        

This ShouldSample function, part of an AdaptiveSampler class, keeps track of how many traces have been sampled in the current second and adjusts its sampling rate to try to meet a target traces-per-second rate. To take this further, one might want to record the current sampling rate in the trace itself so this information can be retained for later investigation.

Adaptive Sampling is an advanced tool that can maximize the efficiency of your data collection when applied correctly. However, it's important to consider its complexities and the expertise required to implement it effectively.

Tail-based Sampling

Tail-based Sampling, unlike the head-based strategies of Fixed Probability Sampling and Adaptive Sampling, makes decisions on whether to keep a trace based on the entirety of the trace data. It evaluates the data after all the spans in a trace are finished and have been collected in a central place. This means it can make decisions based on overall trace conditions such as whether an error occurred, or if the request had a high latency.

The benefit of Tail-based Sampling is the assurance that every trace collected will be useful in understanding and diagnosing issues. It avoids missing critical error conditions that could be overlooked with head-based sampling techniques, which make decisions at the start of a request.

However, Tail-based Sampling does have its own challenges. It typically requires more memory and processing power because all spans must be stored temporarily before making a sampling decision. It may also introduce latency into the data pipeline due to the wait time to gather all spans and make a decision.

Here's an example of how to configure Tail-based Sampling using the OpenTelemetry Collector. The configuration includes three policies:?

  • A latency policy that includes all traces with a latency over 2 seconds
  • A status code policy that includes all traces where the status code was set to ERROR or UNSET
  • A rate-limiting policy that collects all other traces but limits the number of spans to 50 per second

processors:
?tail_sampling:
??decision_wait: 10s
??num_traces: 100
??expected_new_traces_per_sec: 10
??policies:
???[
?????{
??????name: slow-requests,
??????type: latency,
??????latency: {threshold_ms: 2000}
?????},
?????{
??????name: error-requests,
??????type: status_code,
??????status_code: {status_codes: [ERROR, UNSET]}
?????},
?????{
??????name: normal-requests,
??????type: rate_limiting,
??????rate_limiting: {spans_per_second: 50}
?????}
   ]        

In this configuration, the Collector will wait 10 seconds after receiving the first span in a trace, before making a decision. This allows time for the spans from all services to arrive. It will store up to 100 traces for sampling decision-making, expecting around 10 new traces per second (these settings allow the collector to optimize how it allocates memory). It then applies the three policies in order, limiting the final policy to 50 spans per second (a trace typically has multiple spans).

Tail-based Sampling, while requiring more resources, can be an invaluable tool in an observability strategy, ensuring the collection of crucial data for system understanding and problem-solving.

Conclusion

In system observability, collecting telemetry data is a crucial, yet complex task. It requires a fine balance of gathering adequate information for problem detection and diagnosis, while also staying within storage and processing resource limits. The strategies discussed in this article – Fixed Probability Sampling, Adaptive Sampling, and Tail-based Sampling – are essential tools in achieving this balance.

However, these are not the only strategies available. There are other methods and combinations that can be considered and explored. The choice of strategy will depend on the nature of your system, the criticality of various conditions, the volume of traffic, and the resources available for data storage and analysis.

It's crucial to remember that observability is not an afterthought but an integral part of system design. It shouldn't be left up to individual developers to make independent decisions. Instead, an organization-wide strategy for data collection should be in place to provide a holistic and comprehensive view of system health. We should make it as easy as possible for developers to do the right thing.

Strategically planned and intelligently implemented observability allows us to understand our distributed systems better. It ensures we have the necessary data for diagnosing issues without exceeding our data budget. As your system evolves and grows, so should your observability strategy. Always keep iterating and improving, just like the systems you monitor.

If you enjoyed this article, please subscribe to the newsletter and share with your network here on LinkedIn.

要查看或添加评论,请登录

Drew Robbins的更多文章

社区洞察

其他会员也浏览了