How to define Alerts with exception in InfluxDB/Kapacitor

巧了我就是萌 提交于 2020-01-25 09:22:45

问题


I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.

What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?

I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?

This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.

I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.


回答1:


As said in the comments, you can use the sideload node to achieve that.

Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).

Given the following excerpt from a tick script

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |eval(lambda: "numMeasurements")
        .as('value')

var customized = data
    |sideload()
        .source('file:///etc/kapacitor/customizations/demo/')
        .order('hosts/host-{{.hostname}}.yaml')
        .field('maxNumMeasurements',100)
    |log()

var trigger = customized
    |alert()
        .crit(lambda: "value" > "maxNumMeasurements")

and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows

maxNumMeasurements: 10

A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.

There is an example in the documentation handling scheduled downtimes using sideload

Furthermore, I have created an example available on github using docker-compose

Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.




回答2:


Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.

At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.

You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.




回答3:


What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?



来源:https://stackoverflow.com/questions/56972799/how-to-define-alerts-with-exception-in-influxdb-kapacitor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!