AWS Cloudwatch Heartbeat Alarm

匿名 (未验证) 提交于 2019-12-03 08:41:19

问题:

I have an app that puts a custom Cloudwatch metric to AWS every minute. This is supposed to act as a heartbeat so I know the app is alive.

Now I want to put an alarm on this metric to notify me if the heartbeat stops. I have tried to accomplish this using different cloudwatch alarm statistics including "average" and "data samples" and setting an alarm threshold less than 1 over a given period. However, in all cases, if my app dies and stops reporting the heartbeat, the alarm will only go into an "Insufficient Data" state and never into an "Alarm" state.

I understand I can put a notification on the "Insufficient Data" state, but I want this to show up as an alarm. Is this possible in Cloudwatch?

Thanks,

Matt

回答1:

Instead of pushing in a custom metric to Cloudwatch, consider:

Push a message onto an SNS topic, on the same periodic basis as you were doing, and set up a CloudWatch monitor for the SNS topic's NumberOfMessagesPublished metric. If the number of heartbeats falls below the expected value for the time period you specify, whether its because the app crashed, or server crashed, the metric will go into an Alarm state.



回答2:

I think that the alarm going into "Insufficient Data" state has to do with how missing data is being handled. As the doc states:

Similar to how each alarm is always in one of three states, each specific data point reported to CloudWatch falls under one of three categories:

  • Not breaching (within the threshold)
  • Breaching (violating the threshold)
  • Missing

You can specify how alarms handle missing data points. Choose whether to treat missing data points as:

  • missing (The alarm looks back farther in time to find additional data points)
  • notBreaching (Treated as a data point that is within the threshold)
  • breaching (Treated as a data point that is breaching the threshold)
  • ignore (The current alarm state is maintained)

The default behavior is missing.

So i guess that specifying missing data points as breaching would do the trick :)



回答3:

For a single application you need to set the Insufficient Data alarm. I've been playing with @BestPractices solution (https://stackoverflow.com/a/31574866/2447670) and it still falls into the problem of having "Insufficient Data" instead of "Alarm" as far as I know. I'm not sure if I miss something, but I understand that it falls into insufficient data because there is nothing to count/analyze. If you have a "tick" every 1 minute and create an alarm "below 1 tick", then if the app will not send any tick, the alarm will have insufficient data. If there would be "half a tick" (which is impossible of course), then it would go to alarm, because it will have any data to analyze (so it won't have "insufficient data") and at the same time it will be below the threshold.

If you would have e.g. n apps (n > 1) that send a message to SNS to show that they are alive and you would set an alarm for "below n" state, then it would go to the Alarm state if one of the app would die (but again to Insufficient Data if all would die).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!