Why histogram stats are all the same/ inaccurate? Characteristics of Datadog histograms.

More info about the histogram metric type: https://help.datadoghq.com/hc/en-us/articles/205638045-What-is-the-histogram-metric-type-

Characteristics of Datadog histograms?

Percentiles & histogram statistics are computed by each datadog-agent every 10 seconds, for configuration see here.

So you cannot get the 95thpercentile of metric values over the past X hours, you will get the metric of the 95th percentile:

  1. computed by each datadog-agent (host per host basis).
  2. computed on a fixed 10 second interval.

More details below.

Global percentile beta

Datadog now offers global percentiles across your hosts/ tags: percentiles stats are computed on the backend side across your global infrastructure rather than on an agent per agent basis.

More details about the beta here: https://help.datadoghq.com/hc/en-us/articles/115005362583.

If the limitation number 1. outlined above is a pain for your organization/ if you're interested in the beta, please drop a note to support@datadoghq.com to have it enabled for your organization!

More details on limitation 1: what the global percentile beta solves

Without the beta activated you cannot get the overall 95th percentile/avg/etc. of the request time across hosts etc.

Example: the host per host aggregation makes the average of the average different from the global average you may be expecting.

When you have several host reporting histogram metrics, Datadog aggregates their data but cannot restitute global percentiles/ avg etc.

For instance, if you graph avg:response_time.avg{*}, our system will list all sources ( = unique tag combination & host) reporting this metric. Let's use an example where data is seen coming from:

  1. host:X, env:live, request:A
  2. host:Y, env:live, request:A
  3. host:Z, env:live, request:A
  4. host:Z, env:live, request:B

 

Since the avg: aggregation has been chosen, the graph will report:
(value(source1) + value(source2) + value(source3) + value(source4) )/ 4

Example with data sets source by source: source1 {10,10,10}, source2 {1}, source3 {1}, source4{1}.
response_time.avg{source1} will be 10, for the other sources it will be 1.
Datadog will return (10 + 1 + 1 + 1)/4 = 3.25 which is different from the global average = 5.5

Whether the histogram value of source 1 was computed on 100 statsd values or 1, it will have the same weight in the calculation, hence the difference with a centralized aggregation system.

More details on limitation 2: the 10 second flush interval can make all histogram stats equal

Our histograms can give you an idea of how the 10-second 95th percentile of your host request time evolves, but the granularity limitations makes it not possible to get this 95th percentiel over a time period different than 10 sec.

Every 10 seconds, the datadog-agent reviews all histogram dogstatsd packets it has received, it computes statistics (max/avg/95percentile/count/mean) and send the resulting statistics as metrics to Datadog.

Example: If during the 10 second flush interval of the dogstatsd server of the datadog-agent, there is only 1 histogram value (value = 1) that has been sent to host:X with histogram name response_time, the datadog-agent of this host:X will report to Datadog the same value for response_time.avg / .max / .95percentile / .mean = 1

If you have more than 1 value during the 10 second flush interval, let's say values = {1, 9, 8}, the histogram statistics will start to make sense and be different: .avg <= 6, .median <= 8, .max = 9, .count = 3 (number of points) etc.

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.