Monitor and alert best practices

There are many options available for monitoring and alerting in Datadog. This article will cover common issues and best practices to help you get the most out of this feature.

 

Simple Alert

When setting up a monitor on a metric, by default the monitor will look at the aggregate of the metric over all the hosts reporting it. This is called a Simple Alert.

You can choose to aggregate the metric using one of the following methods: average, min, max, or sum. This is considered "space aggregation", meaning it is averaging (or finding the max, etc.) across all timeseries for this metric.

 

Multi Alert

If you change Simple Alert to Multi Alert, you can set up a monitor on each separate timeseries in a group. For example, you could monitor the system.cpu.user metric for each host individually, rather than monitoring the average of that metric across all the hosts:

 

For more info on simple vs. multi alerts, check out this article.

 

Monitoring a sparse metric

Monitoring metrics that are reported infrequently, like AWS CloudWatch metrics, requires a larger timeframe. For example, let's say you want to monitor the CPU utilization of your RDS instances:

With the timeframe set to 5 minutes, Datadog warns that it is too small a timeframe given the sparseness of this particular metric. It is best to set the timeframe to the suggested one to ensure the monitor will work as expected.

 

Complex metric monitors using the Source tab

You can easily monitor more complex metric queries by navigating to the Source tab. For example, you may want to be alerted when the NTP offset on any host is greater than 100s or less than -100s. You can use the absolute value function in the Source tab of the monitor to achieve this with one monitor:

 

Cluster monitoring 

You can use cluster level service monitoring to be alerted only when a certain percentage of all checks has failed. This can be helpful if, for example, you want to be alerted when 30% of all your hosts have stopped reporting, but not if one or two have gone down. Read more about cluster level service monitoring in this article.

 

False alerts or false negatives

If a monitor is alerting (or not alerting) incorrectly, you may have an NTP offset issue on one or more hosts covered by the monitor. You can read more about this issue and how to fix it in this article.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk