Maintaining service level agreements with either external or internal customers often factors in the measure of uptime percentage. In this article we'll cover achieving this by using Datadog's HTTP check and the Query Value widget.
Unlike other monitoring platform or vendors that offer such services, Datadog's HTTP check is not limited to only checking external web pages or endpoints. This is because the HTTP check is among the list of agent side integrations, meaning the checks are conducted by a Datadog agent installed on a host inside your environment. This gives it the unique ability probe endpoints, which you may not be exposing publicly (much like a layer 7 load balancer would). And like other agent side integrations the HTTP check is configured with a simple YAML file, which makes it easy to provision using configuration management.
An example of such check out be:
- name: Amazon
Adding this content to the /etc/datadog-agent/conf.d/http_check.d/conf.yaml file and restarting the agent service would instruct the agent to begin collecting metrics on that endpoint, in this case Amazon's public webpage. A full list of available options and metrics can be found here, but for the purposes of this article we will be focusing on one of those metrics in particular - the network.http.can_connect, which returns either 1 (when we receive valid responses) or 0 for when we're unable to do so. Notice I've used a couple of optional parameters as well, timeout of 3 seconds (meaning return 0 if it take longer than 3 seconds for the repossess to arrive) and I've also added a couple of optional tags for customer name and category.
In Datadog's Metrics explorer this would appear as the following timeseries graph.
While this is view is not particularly useful it validates we've successfully implemented the check and we're getting valid responses from the target URL (amazon.com is returning 200s or 300s HTTP responses within the imposed 3 second timeout).
The next step would be to display these values within the Query Value widget where we would see something similar to:
In both case I have the widgets displaying averages of the network.http.can_connect over the last hour. I have it scoped to the URL:https://www.amazon.com/ as it appeared in my config.yaml
Now I can begin to make some modification to turn that simple integer value into an SLO uptime metric.
I begin by clicking on Advanced which allows me to multiply the average value returned by network.http.can_connect by 100, making it a percentage rather than a ratio. I then hide the original value to show only that of the function. I deselect the Autoscale option, which enforces the widget to always display a float with 2 decimals. I used the Custom units option to provide more context by adding the "%" sign. I then extend to widget timeline from 1 hour, all the way to 1 month. Now that I have the true uptime percentage value with precision of 2 decimals, I can add some better visuals. To do that I make use of conditional formatting, by selecting a red background for any values under 99.99 of uptime, potentially implying I am not meeting my agreed upon SLO for that customer.
This is a simple example, but it can be extended to many use cases. For example - reporting for a single endpoint from multiple geographically spread-out agents, all of which are tasked with checking that endpoint. In the example I selected the URL tag, but I could have just as easily selected an entire category with my "ecommerce" tag, checking on all the endpoint that would fall under that tag. Tag combinations can be used as well to further enhance functionality, e.g.
The same logic can be applied when configuring monitors to trigger alerts when the desired SLOs are not met. All of the advanced functionality used in this widget is also available to you in Datadog's metrics monitors.
The JSON for the widget in the animated GIF above is: