With cluster level service monitoring you can set alerts that trigger when a percentage of servers in a given cluster experience availability issues.
Warning and Critical alert thresholds
Datadog gives you the ability to set two types of alerts: a Warning alert and a Critical alert. Here’s an example of how you might set these alerts. For your web cluster, you might set a Warning threshold of 10% and a Critical threshold of 20%. So, if 10% of your web servers go down, your team would automatically get the Warning alert, and if 20% went down, they’d get the Critical alert.
Monitor by availability zone, environment, roles, and other groupings
Datadog gives you the ability to group your alerts by any combination of tags you set up. If your application runs on AWS, you might want to alert when more than 40% of servers are down in any AWS Availability Zone. In this example, you are able to trace the problem to the alerting Zone instead of being overwhelmed with the noise of each server going down. If you use a configuration management tool like Chef, you may want to set up a role-wide alert: send a critical alert when 20% of all nodes with the role “hadoop-hdfs” go down.
Different groupings can have different alert threshold percentages specified. For example, your database cluster might have a pretty low percentage threshold set before throwing an alarm. Your load balancers, on the other hand, might be much more resilient and could be mostly inactive before any performance issues are noticed, justifying a much higher threshold of unavailable hosts before throwing an alarm.