Persist 'for' State of Alerts

Ganesh Vernekar

GSoC 2018

August 22, 2018

Introduction

This happens to be the first issue I fixed during my GSoC. You can find the PR#4061 here, which is already merged into Prometheus master and is available from v2.4.0 onwards.

This post assumes that you have a basic understanding of what monitoring is and how alerting is related to it. If you are new to this world, this post should help you get started.

Easy ≤ Difficulty < Medium

Issue

To talk about alerting in Prometheus in layman terms, an alerting rule consists of a condition, for duration, and a blackbox to handle the alert. So the simple trick here is, if the condition is true for for duration amount of time, we trigger an alert (called as 'firing' of alert) and give it to the blackbox to handle it in the way it wants, which can be sending a mail, message in slack, etc.

As discussed here, consider that you have an alert with for duration as 24hrs, and Prometheus crashed while that alert has been active (condition is true) for 23hrs, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for 24hrs again before firing!

You can find the GitHub issue #422 here

The Fix

Use time series to store the state! The procedure is something like this:

 1. During every evaluation of alerting rules, we record the ActiveAt (when did condition become true for the first time) of ever alert in a time series with name ALERTS_FOR_STATE, with all the labels of that alert. This is like any other time series, but only stored in local.

 2. When Prometheus is restarted, a job runs for restoring the state of alerts after the second evaluation. We wait till the second evaluation so that we have enough data scraped to know the current active alerts.

 3. For each alert which is active right now, the job looks for its corresponding ALERTS_FOR_STATE time series. The timestamp and the value of the last sample of the series gives us the info about when did Prometheus went down and when was the alert last active at.

 4. So if the for duration was say D, alert became active at X and Prometheus crashed at Y, then the alert has to wait for more D-(Y-X) duration (Why? Think!). So variables of the alert are adjusted to make it wait for more D-(Y-X) time before firing, and not D.

Things to keep in mind (If you are using Prometheus)

rules.alert.for-outage-tolerance | default=1h
This flag specifies how long Prometheus will be tolerant on downtime. So if Prometheus has been down longer than the time set in this flag, then the state of the alerts are not restored. So make sure to either change the value of flag depending on your need or get Prometheus up soon!

rules.alert.for-grace-period | default=10m
We would not like to fire an alert just after Prometheus is up. So we introduce something called "grace period", where if D-(Y-X) happens to be less than rules.alert.for-grace-period, then we wait for the grace period duration before firing the alert.
Note: We follow this logic only if the for duration was itself ≥ rules.alert.for-grace-period.

Gotchas

As the ALERTS_FOR_STATE series is stored in local storage, if you happen to lose the local TSDB data while Prometheus is down, then you lose the state of the alert permanently.

GSoC 2018 | Unit Testing for Rules >>