Google Summer of Code 2018 with Prometheus

Ganesh Vernekar

August 22, 2018

Introduction

Hello! I am Ganesh Vernekar. I successfully completed Google Summer of Code with Prometheus in the summer of 2018. I was mentored by Goutham Veeramachaneni ( ).

I did 3 independant addition/fixes as a part of my GSoC. All related to rules/alerting rules. Apart from my proposal, I also fixed some bugs in prometheus/tsdb during GSoC period.

If you find yourself lost below, be sure to click on 'Learn More'.

Fixes

1) Persist 'for' State of Alerts

Prometheus had 1 serious long standing issue, where, if the Prometheus server crashes, the state of the alert is lost.

Consider that you have an alert with for duration as 24hrs, and Prometheus crashed while that alert has been active for 23hrs, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for 24hrs again before firing!

Learn More

Features

2) Unit Testing for Rules

Alerting is an important feature in monitoring when it comes to maintaining site reliability, and Prometheus is being used widely for this. We also record many rules to visualise later. Hence it becomes very important to be able to check the correctness of the rules.

In this feature, I added the support of unit testing for both alerting and recording rules.

Learn More

3) UI for Testing Alerting Rules

As you saw above how important alerting rules are for monitoring, Prometheus also lacks any good and convenient way of visualising and testing the alert rules before it can be used.

In this feature I add a UI for entering your alerting rules and testing+visualising it on the real data that is there in your server.

Learn More

Epilogue

This work would not have been possible without valuable inputs and reviews by Brian Brazil ( ) and Julius Volz ( ), thanks!

I gave a lightning talk at PromCon 2018 regarding all that you read above. It was held in Munich, Germany.

Thank you!