I did 3 independant addition/fixes as a part of my GSoC. All related to rules/alerting rules. Apart from my proposal, I also fixed some bugs in prometheus/tsdb during GSoC period.
If you find yourself lost below, be sure to click on 'Learn More'.
Prometheus had 1 serious long standing issue, where, if the Prometheus server crashes, the state of the alert is lost.
Consider that you have an alert with
for duration as
24hrs, and Prometheus crashed while that alert has been active for
23hrs, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for
24hrs again before firing!
Alerting is an important feature in monitoring when it comes to maintaining site reliability, and Prometheus is being used widely for this. We also record many rules to visualise later. Hence it becomes very important to be able to check the correctness of the rules.
In this feature, I added the support of unit testing for both alerting and recording rules.Learn More
As you saw above how important alerting rules are for monitoring, Prometheus also lacks any good and convenient way of visualising and testing the alert rules before it can be used.
In this feature I add a UI for entering your alerting rules and testing+visualising it on the real data that is there in your server.Learn More
I gave a lightning talk at PromCon 2018 regarding all that you read above. It was held in Munich, Germany.