If you have Monitors that are alerting/not alerting when you don't think they should be, try the following to troubleshoot the issue:
- Make sure clocks on all the hosts are in sync with NTP. Datadog allows points up to 10 minutes in the future but when you have some values reporting too far into the future it will throw off the calculation. We use the "latest timestamp" across all contexts as the basis for "now" in our aggregation (excluding when we calculate no-data). This is so we can handle small offsets in either direction (future/past) without sending wonky values. If there is too large an offset, you may see false alerts. If you're using a multi-alert and one host out of n is skewed it will cause issues for all monitored instances.
- Do you see any gaps in data when looking at a host that was sending false positives? Or is the latest point older than you expect?
- Do the Agent logs contain any errors?
Sending us the results of these steps can help us get to the bottom of your issue more quickly!
If you're experiencing issues due to clock skew, we have a feature flag we can enable on your request ([email protected]) that may be enabled for your Datadog organization that changes this logic so that if any contexts are reporting in the future, we just default to the current time in UTC as the "now" value. This will also make the alert work as it's intended to but will mean that these hosts won't be evaluated correctly as they'd be evaluated "in the past".
To avoid the issue in how they're evaluated you can fix the hosts and ensure they're in sync with NTP.
Agent 5.3 or greater checks NTP by default to help identify this issue sooner.