Skip to content

Latest commit

 

History

History
119 lines (75 loc) · 5.4 KB

datadog.md

File metadata and controls

119 lines (75 loc) · 5.4 KB

Datadog

Backend

Using the datadog backend class, you can query any metrics available in Datadog to create an SLO.

backends:
  datadog:
    api_key: ${DATADOG_API_KEY}
    app_key: ${DATADOG_APP_KEY}

The following methods are available to compute SLOs with the datadog backend:

  • good_bad_ratio for computing good / bad metrics ratios.
  • query_sli for computing SLIs directly with Datadog.
  • query_slo for getting SLO value from Datadog SLO endpoint.

Optional arguments to configure Datadog are documented in the Datadog initialize method here. You can pass them in the backend section, such as specifying api_host: api.datadoghq.eu in order to use the EU site.

Good / bad ratio

The good_bad_ratio method is used to compute the ratio between two metrics:

  • Good events, i.e events we consider as 'good' from the user perspective.
  • Bad or valid events, i.e events we consider either as 'bad' from the user perspective, or all events we consider as 'valid' for the computation of the SLO.

This method is often used for availability SLOs, but can be used for other purposes as well (see examples).

Config example:

backend: datadog
method:  good_bad_ratio
service_level_indicator:
  filter_good: app.requests.count{http.path:/, http.status_code_class:2xx}
  filter_valid: app.requests.count{http.path:/}

Full SLO config

Query SLI

The query_sli method is used to directly query the needed SLI with Datadog: Datadog's query language is powerful enough that it can do ratios natively.

This method makes it more flexible to input any datadog SLI computation and eventually reduces the number of queries made to Datadog.

backend: datadog
method: query_sli
service_level_indicator:
  expression: sum:app.requests.count{http.path:/, http.status_code_class:2xx} / sum:app.requests.count{http.path:/}

Full SLO config

Query SLO

The query_slo method is used to directly query the needed SLO with Datadog: indeed, Datadog has SLO objects that you can directly refer to in your config by inputing their slo_id.

This method makes it more flexible to input any datadog SLI computation and eventually reduces the number of queries made to Datadog.

To query the value from Datadog SLO, simply add a slo_id field in the measurement section:

backend: datadog
method: query_slo
service_level_indicator:
  slo_id: ${DATADOG_SLO_ID}

Full SLO config

Examples

Complete SLO samples using datadog are available in samples/datadog. Check them out!

Exporter

The datadog exporter allows to export SLO metrics to the Datadog API.

exporters:
 datadog:
   api_key: ${DATADOG_API_KEY}
   app_key: ${DATADOG_APP_KEY}

Optional arguments to configure Datadog are documented in the Datadog initialize method here. You can pass them in the backend section, such as specifying api_host: api.datadoghq.eu in order to use the EU site.

Optional fields:

  • metrics: [optional] list - List of metrics to export (see docs).

Full SLO config

Datadog API considerations

The distribution_cut method is not currently implemented for Datadog.

The reason for this is that Datadog distributions (or histograms) do not conform to what histograms should be (see old issue), i.e a set of configurable bins, each providing the number of events falling into each bin.

Standard histograms representations (see wikipedia) already implement this, but the approach Datadog took is to pre-compute (client-side) or post-compute (server-side) percentiles, resulting in a different metric for each percentile representing the percentile value instead of the number of events in the percentile.

This implementation has a couple of advantages, like making it easy to query and graph the value of the 99th, 95p, or 50p percentiles; but it makes it effectively very hard to compute a standard SLI for it, since it's not possible to see how many requests fall in each bin; hence there is no way to know how many good and bad events there are.

Three options can be considered to implement this:

OR

  • Implement support for standard histograms where bucketization is configurable and where it's possible to query the number of events falling into each bucket.

OR

  • Design an implementation that tries to reconstitute the original distribution by assimilating it to a Gaussian distribution and estimating its parameters. This is a complex and time-consuming approach that will give approximate results and is not a straightforward problem (see StackExchange thread)