Unit Testing for Rules

Ganesh Vernekar

GSoC 2018

August 22, 2018

Introduction

It is always good to do 1 last check of all the components of your code before you deploy it. We have seen how important alerting and recording is in the monitoring world. So why not test even the alerting and recording rules?

This was proposed long back in this GitHub issue #1695, and I worked on this during my GSoC. The work can be found in this PR#4350, which has been merged with Prometheus master.

Syntax

We use a separate file for specifying unit tests for alerting rules and PromQL expressions (in place of recording rules). This syntax of the file is based on this design doc which was constantly reviewed by Prometheus members.

Edit: This blog post will not be updated with any changes to unit testing. It might get outdated in future, hence also have a look at official documentation here.

The File

# This is a list of rule files to consider for testing.
rule_files:
  [ - <file_name> ]

# optional, default = 1m
evaluation_interval: <duration>

# The order in which group names are listed below will be the order of evaluation of 
# rule groups (at a given evaluation time). The order is guaranteed only for the groups mentioned below. 
# All the groups need not be mentioned below.
group_eval_order:
  [ - <group_name> ]

# All the test are listed here.
tests:
  [ - <test_group> ]

<test_group>

# Series data
interval: <duration>
input_series:
  [ - <series> ]

# Unit tests for the above data.

# Unit tests for alerting rules. We consider the alerting rules from the input file.
alert_rule_test:
  [ - <alert_test_case> ]

# Unit tests PromQL expressions.
promql_expr_test:
  [ - <promql_test_case> ]

<series>

# This follows the series notation (x{a="b", c="d"}). You can see an example below.
series: <string>

# This uses expanding notation. Example below.
values: <string>

<alert_test_case>

Prometheus allows you to have same alertname for different alerting rules. Hence in this unit testing, you have to list the union of all the firing alerts for the alertname under a single <alert_test_case>.

# It's the time elapsed from time=0s when the alerts have to be checked.
eval_time: <duration>

# Name of the alert to be tested.
alertname: <string>

# List of expected alerts which are firing under the given alertname at 
# given evaluation time. If you want to test if an alerting rule should 
# not be firing, then you can mention the above fields and leave 'exp_alerts' empty.
exp_alerts:
  [ - <alert> ]

<alert>

Remember, this alert shoud be firing.

# These are the expanded labels and annotations of the expected alert. 
# Note: labels also include the labels of the sample associated with the 
# alert (same as what you see in `/alerts`, without series `__name__` and `alertname`)
exp_labels:
  [ <labelname>: <string> ]
exp_annotations:
  [ <labelname>: <string> ]

<promql_test_case>

# Expression to evaluate
expr: <string>

# It's the time elapsed from time=0s when the alerts have to be checked.
eval_time: <duration>

# Expected samples at the given evaluation time.
exp_samples:
  [ - <sample> ]

<sample>

# Labels of the sample in series notation.
labels: <series_notation>

# The expected value of the promql expression.
value: <number>

Example

This is an example input files for unit testing which passes the test. alerts.yml contains the alerting rule, tests.yml follows the syntax above.

alerts.yml

# This is the rules file.

groups:
- name: example
  rules:

  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
        severity: page
    annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  - alert: AnotherInstanceDown
    expr: up == 0
    for: 10m
    labels:
        severity: page
    annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

test.yml

# This is the main input for unit testing. 
# Only this file is passed as command line argument.

rule_files:
    - alerts.yml

evaluation_interval: 1m

tests:
 # Test 1.
    - interval: 1m
  # Series data.
      input_series:
          - series: 'up{job="prometheus", instance="localhost:9090"}'
            values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
          - series: 'up{job="node_exporter", instance="localhost:9100"}'
            values: '1 1 1 1 1 1 1 0 0 0 0 0 0 0 0'
          - series: 'go_goroutines{job="prometheus", instance="localhost:9090"}'
            values: '10+10x2 30+20x5'
          - series: 'go_goroutines{job="node_exporter", instance="localhost:9100"}'
            values: '10+10x7 10+30x4'

 # Unit test for alerting rules.
    alert_rule_test:
    # Unit test 1.
        - eval_time: 10m
          alertname: InstanceDown
          exp_alerts:
      # Alert 1.
              - exp_labels:
                    severity: page
                    instance: localhost:9090
                    job: prometheus
                exp_annotations:
                    summary: "Instance localhost:9090 down"
                    description: "localhost:9090 of job prometheus has been down for more than 5 minutes."
 # Unit tests for promql expressions.
    promql_expr_test:
    # Unit test 1.
        - expr: go_goroutines > 5
          eval_time: 4m
          exp_samples:
      # Sample 1.
              - labels: 'go_goroutines{job="prometheus",instance="localhost:9090"}'
                value: 50
      # Sample 2.
              - labels: 'go_goroutines{job="node_exporter",instance="localhost:9100"}'
                value: 50

Usage

This feature will come embedded in promtool.
# For the above example.
./promtool test rules test.yml

# If you have multiple such test files, say test{1,2,3}.yml
./promtool test rules test1.yml test2.yml test3.yml

What is tested?

1. Syntax of the rule files included in the test.
2. Correcness of template variables. Note that, if you have used $labels.something_wrong, it wont be caught at this stage.
3. If the alerts listed for the alertname are exactly same as what we get after simulation over the data.
4. Exact match for the samples returned by PromQL expressions at given time. Order doesn't matter.

While we do the matches in 3 and 4, usage of $labels.something_wrong will be caught as it will result in an empty string.

<< Persist 'for' state | GSoC 2018 | Unit Testing for Rules >>