Ganesh Vernekar Blog

Prometheus TSDB (Part 7): Snapshot on Shutdown

Tue, 14 Sep 2021 00:00:00 GMT

Introduction

In part 2 we saw that TSDB uses Write-Ahead-Log (WAL) to provide durability against crashes. But it also makes restarts of Prometheus slow when you hit a decent scale because replaying Checkpoint+WAL takes time.

In this post we will understand more about a new feature introduced in Prometheus v2.30.0: taking snapshots of in-memory data during the shutdown for faster restarts by entirely skipping the WAL replay.

About snapshot

Snapshot in TSDB is a read-only static view of in-memory data of TSDB at a given time.

The snapshot contains the following (in order):

All the time series and the in-memory chunk of each series present in the Head block. (Part 3 recap: except the last chunk, everything else is already on disk and m-mapped).
All the tombstones in the Head block.
All the exemplars in the Head block.

Taking inspiration from the checkpoints in part 2, we name these snapshots as chunk_snapshot.X.Y, where X is the last WAL segment number that was seen when taking the snapshot, and Y is the byte offset up to which the data was written in the X WAL segment.

data
├── 01EM6Q6A1YPX4G9TEB20J22B2R
|   ├── chunks
|   |   ├── 000001
|   |   └── 000002
|   ├── index
|   ├── meta.json
|   └── tombstones
├── chunk_snapshot.000005.12345
|   ├── 000001
|   └── 000002
├── chunks_head
|   ├── 000001
|   └── 000002
└── wal
   ├── checkpoint.000003
   |   ├── 000000
   |   └── 000001
   ├── 000004
   └── 000005

We take this snapshot when shutting down the TSDB after stopping all writes. If the TSDB was stopped abruptly, then no new snapshot is taken and the snapshot from the last graceful shutdown remains.

This feature is disabled by default and can be enabled with --enable-feature=memory-snapshot-on-shutdown.

Snapshot format

Snapshot uses the generic WAL implementation of Prometheus TSDB and defines 3 new record formats for the snapshots.

The order of records in the snapshot is always:

Series record (>=0): This record is a snapshot of a single series. One record is written per series in an unsorted fashion. It includes the metadata of the series and the in-memory chunk data if it exists.
Tombstones record (0-1): After all series are done, we write a tombstone record containing all the tombstones of the Head block. A single record is written into the snapshot containing all the tombstones.
Exemplar record (>=0): At the end, we write one or more exemplar records while batching up the exemplars in each record. Exemplars are in the order they were written to the circular buffer.

The format of these records can be found here, we won't be discussing them in this blog post.

Restoring in-memory state

With chunk_snapshot.X.Y, we can ignore the WAL before Xth segment's Y offset and only replay the WAL after that because the snapshot along with m-mapped chunks represents the replayed state until that point in the WAL.

Hence with snapshots enabled, the replay of data to restore the Head goes as follows:

Iterate all the m-mapped chunks as described in part 3 and build the map.
Iterate the series records from the latest chunk_snapshot.X.Y one by one. For each series record, re-create the series in the memory with the labels and the in-memory chunk in the record.
Similar to handling the Series record in the WAL, we look for corresponding m-mapped chunks for this series reference and attach it to this series.
Read the tombstones record if any from the snapshot and restore it into the memory.
Iterate the exemplar records one by one if any and put it back into the circular buffer in the same order.
After replaying the m-mapped chunks and the snapshot, we continue the replay of the WAL from Xth segment's Y byte offset as usual. If there are WAL checkpoints numbered >=X, we also replay the last checkpoint before replaying the WAL.

In majority cases (i.e. graceful shutdowns), there will be no WAL to be replayed since the snapshot is taken after stopping writes during the shutdown. There will be WAL/Checkpoint to be replayed if Prometheus happens to abruptly crash/shutdown.

Faster restarts

When we talk about restarts, it is not only the time taken to replay the data on disk to restore the memory state, but also the time taken to shutdown, because snapshotting now adds some delay to shutdown.

Writing snapshots takes time in the magnitude of seconds, and is usually under a minute for 1 million series. And replaying the checkpoint also takes time in the magnitude of seconds. While the WAL replay can take multiple minutes for the same number of series.

By skipping the WAL replay entirely during graceful restart, we have seen anywhere between 50-80% reduction in restart time.

Few things to be aware of

Snapshot will take additional disk space when enabled and does not replace an existing thing.
Depending on how many series you have and the write speed of your disk, shutdown can take a little time. Therefore, set your pod termination grace period (or equivalent) for Prometheus pod accordingly.

Pre-answering some questions

Why take snapshots only on shutdown?

When we look at the number of times a sample is written on the disk (or re-written during compaction), it is only a handful. If we take snapshots at intervals while Prometheus is running, this can increase the number of times a sample is written to disk by a big %, hence causing unnecessary write amplification. So we chose to go with the majority case of a graceful shutdown while a crash would read part of WAL depending on the last snapshot present on the disk.

Why do we still need WAL?

If Prometheus happens to crash due to various reasons, we need the WAL for durability since a snapshot cannot be taken. Additionally, remote-write depends on the WAL.

Code reference

The code for taking the snapshot and reading it is present in tsdb/head_wal.go.

Here is the entire Prometheus TSDB blog series

Prometheus TSDB (Part 6): Compaction and Retention

Tue, 27 Jul 2021 00:00:00 GMT

Introduction

When Prometheus has created a bunch of blocks, we need to regularly perform maintenance on those blocks to make efficient use of the disk and keep the queries performant.

In this blog post, we are going to look at 2 topics, compaction and retention, which happen in the background when Prometheus is running.

If you have not read the earlier parts of this blog post series, now is a good time to check out part 1 and part 4 to understand this blog post better.

Compaction

Compaction consists of writing a new block from one or more existing blocks (called the source blocks or parent blocks), and at the end, the source blocks are deleted and the new compacted block is used in place of those source blocks.

But why do we need compaction?

As we saw in part 4, any deletions to the data are stored as tombstones in a separate file while the data still stays on disk. So when the tombstones are touching more than some % of the series, we need to remove that data from the disk.
With low enough churn, most of the data in the index in adjacent blocks (w.r.t. time) is going to be the same. So by compacting (merging) those adjacent blocks, we can deduplicate a large part of the index and hence save disk space.
When a query hits >1 block, we have to merge the result we get from individual blocks and that can be a bit of overhead. By merging adjacent blocks, we prevent this overhead.
If there are overlapping blocks (overlapping w.r.t. time), querying them requires deduplication of samples between blocks which is significantly more expensive than just concatenating chunks from different blocks. Merging these overlapping blocks avoid the need for deduplication.

Below are the two steps for single compaction to take place. Every minute we initiate a compaction cycle where we check for step-1 and only proceed to step-2 if step-1 was not empty. The compaction cycle runs these steps in a loop and exits when step-1 is empty.

Step 1: The "plan"

A "plan" is a list of blocks to be compacted together, picked based on the below conditions in order of priority (highest to lowest). The first condition that is satisfied generates a plan, hence only 1 condition per plan. When none of the conditions meet, the plan is empty.

Condition 1: Overlapping blocks

As we saw above, overlapping blocks can make queries slow. Moreover, Prometheus itself does not produce overlapping blocks, it's only possible if you backfill some data into Prometheus. So highest priority goes to removing the overlap and getting the state back to what Prometheus will produce.

The plan can consist >2 blocks. Take this example:

|---1---|
            |---2---|
      |---3---|
                  |---4---|

While there are only 2 blocks per overlap, if you look closely, when we compact one overlap, let say 1 and 3, they together will eventually overlap with 2. So instead of going through multiple cycles to fix all the linked overlaps, the first pass will choose [1 2 3 4] as the plan and reduce the number of compactions.

Another example that produces a single plan [1 2 3]

|-----1-----|
  |--2--|
     |----3----|  

Note that overlapping blocks support is not enabled by default in Prometheus, it will error out on startup or runtime if you have overlapping blocks, unless enabled via --storage.tsdb.allow-overlapping-blocks flag.

Condition 2: Preset time ranges

In this, we pick >1 block to merge to fill some preset time ranges. In Prometheus, by default, time ranges are [2h 6h 18h 54h 162h 486h], i.e. starting at 2h with a multiple of 3.

Let's take an example of 6h range. We divide the Unix time into buckets as 0-6h, 6h-12h, 12h-18h ..., and if >1 block falls into any single bucket, that forms a plan and we compact them together to form a block up to 6h long.

We also take care to not compact the newest blocks that do not span the entire bucket together yet. For example, the latest 2 blocks of 2h range won't be compacted together since they are (1) new (2) do not span 6h combined. Since Prometheus produces 2h blocks, when we have >=3 blocks, the blocks falling into the same buckets are compacted together.

Similarly, we check all ranges to see if there is any time bucket that has >1 block falling in it. At the end of the compaction cycle, there will be no time bucket with >1 block for all ranges.

In Prometheus, the maximum size of a block can be either 31d (i.e. 744h), or 1/10th of the retention time, whichever is lower.

Condition 3: Tombstones covering some % of series

In the end, if any block has tombstones touching >5% of the total series in the block, we pick that for compaction where the data pointed out by tombstones is deleted from the disk (by creating a new block with no samples covered by the tombstones). This produces a plan with only 1 block.

Step 2: The compaction itself

As we saw in part 4, persistent blocks are immutable. To do any changes, we have to write a new block. Similarly, in compaction, we write an entirely new block, even if it is compaction of a single block. The compaction step only receives the list of blocks to compact together into a single block and is ignorant about the logic used to create this plan.

The compaction logic has been evolving with time with various memory management techniques and faster merging of data. At a higher level, compaction does an N way merge of the series from the source block while iterating through series one by one in a sorted fashion (the order in which they appear in index too).

While the series is deduplicated in the index, when the blocks are not overlapping, the chunks are concatenated together from source blocks. If blocks are overlapping, only the overlapping chunks are uncompressed, samples are deduped (i.e. only keep 1 sample for matching timestamp), and compressed back into >=1 chunk while keeping the max size of chunk to 120 samples.

If there are tombstones in any of the blocks, the chunks of those series are re-written to exclude the time ranges mentioned in the tombstones. The final block won't have any tombstones.

Every compacted block is given a compaction level, which tells the generation of the block, i.e. number of times blocks have been compacted to get this one. It is max(level of source blocks) + 1 for the new block.

If all samples of a series are deleted, then the series is skipped from the new block entirely. If the block has 0 samples (i.e. empty block), then no block is written to the disk while the source blocks are deleted.

Note that compaction itself does not delete the source blocks, but only marks them as deletable (in their meta.json). The loading of new blocks and deletion of source blocks is handled by the TSDB separately after the compaction cycle has ended.

Head compaction

This is a special kind of compaction where the source is the Head block and the compaction persists part of the Head block into persistent blocks while removing any data pointed by tombstones.

Part 1 has an illustration and explanation of when the Head compaction is done. Head block implements the same interface as that of a persistent block reader, hence we use the same compaction code to also compact the Head block into a persistent block.

The block produced from the Head block has compaction level 1.

Retention

TSDB allows setting retention policies to limit how much data you store in it. There are 2 of them, time-based and size-based retention. You can either set one of them or both of them. When you set both of them, it is a OR between them, i.e. the first one to satisfy will trigger the deletion of relevant data.

Time based retention

In this, you mention how long should the data span in the TSDB. It is a relative time span calculated w.r.t. the max time of the newest persistent block (and not w.r.t. the Head block). A block is deleted when it goes completely beyond the time retention period and not when part of the block goes beyond the time retention.

For example, if the retention period is 15d, as soon as the gap between the oldest block's max time and the newest block's max time goes beyond 15d, the oldest block is deleted.

Size based retention

In this, you mention the max size of the TSDB on disk. It includes the WAL, checkpoint, m-mapped chunks, and persistent blocks. Although we count all of them to decide any deletion, WAL, checkpoint, and m-mapped chunks are required for the normal operation of TSDB. So even if they together go beyond the size retention, only the blocks are the ones that are deleted. So TSDB may take more than the specified max size if you set it too low.

Size-based retention is stricter compared to time-based retention. As soon as the entire space taken is at least 1 byte more than the max size, the oldest block is deleted.

Code reference

tsdb/compact.go has the code for the creation of plan and compacting the blocks.

storage/merge.go has the code for concatenating/merging the chunks from different blocks (both for overlapping and non-overlapping chunks).

tsdb/db.go has the code for initiating the compaction cycle every minute and calling the step-1 & step-2 on blocks and compaction of the Head block. It also has the code for both types of retention.

Here is the entire Prometheus TSDB blog series

Prometheus TSDB (Part 5): Queries

Mon, 04 Jan 2021 00:00:00 GMT

Introduction

In the last four blog posts we saw the internals of how data is stored in the TSDB. It's now time to know how to query it. In this blog post we will be looking at 3 types of query that we do on the persistent blocks and briefly about the Head block.

Part 4 is a prerequisite for this blog post which talks about how data is stored in persistent blocks. Here are part 1, 2 and 3 in case you missed them.

[Edit 2021-01-16]: Some details about how the negation matchers work were updated.

Prologue

Don't confuse this querying with PromQL queries. In this blog post we will see the low level TSDB queries used to get the raw data from the TSDB. PromQL engine performs these TSDB queries to get the raw data and execute PromQL logic on it. So we are working at a layer lower than PromQL engine.

Types of TSDB Queries

There are 3 types of queries that we run on persistent blocks at the time of writing this blog post.

LabelNames(): returns all unique label names present in the block.
LabelValues(name): returns all the possible label values for the label name name as seen in the index.
Select([]matcher): returns the samples for the given slice of matchers for the series. We will talk more about these matchers later.

Before we run any query on the block, we create something called a Querier on the block which has the min time (mint) and max time (maxt) for the query to be run. This mint and maxt is only applicable to the Select query while the other two always look at all the values in the block.

We will discuss how we combine results from multiple blocks after looking at all 3 query types.

`LabelNames()`

This returns all the unique label names present in the block. To recap, in the series {a="b", c="d"}, the label names are "a" and "c".

In Part 4 it was mentioned that the Label Offset Table was no longer used and is being written only for backward compatibility. Hence both LabelNames() and LabelValues() use Postings Offset Table.

When the index of the block is loaded on startup (or block creation), we store map map[labelName][]postingOffset of label name to a list of some label value's position in the postings offset table (every 32nd at the moment, including the first and the last label value). Storing only some of the value helps in saving memory. This map is created by iterating through all the entries in Postings Offset Table when loading the block.

You can now imagine how we can get the label names - just iterate this in-memory map for its keys and there you have the label names. They are sorted before returning. This is useful for query autocomplete suggestions on UI.

`LabelValues(name)`

We saw above that we store positions of the first and the last label value in the memory for all label names. Hence for LabelValues(name) query, we take the first and last label value position for the given name and iterate on the disk between those two positions to get all the label values for that label name. Another recap here: all the label values for a label name are stored together lexicographically in Postings Offset Table.

For example if the series in the block were {a="b1", c="d1"}, {a="b2", c="d2"} and {a="b3", c="d3"}, then LabelValues("a") would yield ["b1", "b2", "b3"], LabelValues("c") would yield ["d1", "d2", "d3"].

This again helps in query autocomplete suggestions.

`Select([]matcher)`

This query helps in getting the raw TSDB samples from the series described by the given matchers. Before we talk about this query, we need to know what are matchers.

Matcher

A matcher tells the label name value combination that should match in a series. For example, a matcher a="b" says pick all the series which has the label pair a="b".

There are 4 types of matchers

Equal labelName="": the label name should exactly match the given value.
Not Equal labelName!="": the label name should not exactly match the given value.
Regex Equal labelName=~"": the label value for the label name should satisfy the given regex.
Regex Not Equal labelName!~"": the label value for the label name should not satisfy the given regex.

The labelName is the full label name and no regex is allowed there. The regex matchers should match the entire label value and not partially since it is anchored with ^(?:)$ before using.

Let's say the series are

s1 = {job="app1", status="404"}
s2 = {job="app2", status="501"}
s3 = {job="bar1", status="402"}
s4 = {job="bar2", status="501"}

Here are some matcher examples

status="501" -> (s2, s4)
status!="501" -> (s1, s3)
job=~"app.*" -> (s1, s2)
job!~"app.*" -> (s3, s4)

And when there are >1 matchers, it is an AND operation (i.e. intersection) between all the matchers.

job=~"app.*", status="501" -> (s1, s2) ∩ (s2, s4) -> (s2)
job=~"bar.*", status!~"5.." -> (s3, s4) ∩ (s1, s3) -> (s3)

Selecting samples

First step is to get the series that the matchers match. We need to get all the series for individual matchers and then finally intersect them.

We saw in part 4 that a "posting" is the series ID which tells us the position of series info in the index. Postings Offset Table and Postings i together give all the postings for a label-value pair.

Getting postings for a single matcher

If it is an Equal matcher, say a="b", we directly get the postings list position for that from the postings offset table. Since we store positions for only some of the label values for a name, we get the two values between which "b" falls for label name a and iterate the entries between them till we find "b". The a="b" entry in the offset table points to a postings list which is all the series ids that contain a="b". If there is no such entry in the offset table, then it's an empty list of postings for the matcher.

For Regex Equal a=~"", we have to iterate through all the label values of a in the Postings Offset Table and check for the matcher condition. We take the postings list of all the matched entries and merge it (union) to get the sorted postings list for this matcher. Taking an example of job=~"app.*" from above, we find job="app1" -> (s1) and job="app2" -> (s2), and after merging we have job=~"app.*" -> (s1, s2).

With Not Equal a!="b" and Regex Not Equal a!~"", it is a little different in how we internally use it. We get Equal and Regex Equal for corresponding Not Equal and Regex Not Equal (i.e. a!="b" becomes a="b"and a!~"" becomes a=~"") since getting everything that does not match can be pretty huge in practice. Because of this, you cannot use a standalone negation matcher in a query, you need to have at least one Equal or Regex Equal matcher. We take these postings after conversion and do a set subtraction instead. See below for example.

Postings for multiple matchers

Using the above procedure we first get the postings list for all individual matchers. And, similar to what we discussed about matchers before, we intersect them to finally get the postings list (series) that satisfy all the matchers. Note the change in set operation when we have a negation matcher.

job=~"bar.*", status!~"5.*"

-> (job=~"bar.*") ∩ (status!~"5.*")

-> (job=~"bar.*") - (status=~"5.*")

-> ((job="bar1") ∪ (job="bar2")) - (status="501")

-> ((s3) ∪ (s4)) - (s2, s4)

-> (s3, s4) - (s2, s4) -> (s3)

Similarly, if the matchers were a="b", c!="d", e=~"f.*", g!~"h.*", then the set operations would be ((a="b") ∩ (e=~"f.*")) - (c="d") - (g=~"h.*").

Getting the samples finally

Once we have all the series ids (postings) for the matchers, we simply go through those one by one and do the following

Go to the series in the Series table represented by the series id.
Pick all the chunk references from that series which overlap with the time range mint through maxt specified by the querier.
Create an iterator to iterate over these chunks from the chunks directory for samples between mint and maxt.

Select([]matcher) finally returns sample iterators for all the series that matches the matchers. The series are sorted w.r.t. their label pairs.

Some Implementation Details

When getting the postings for a matcher, all the postings for all the matching entries are not got into the memory at the same time. Since the index is memory-mapped from disk, the postings are lazily iterated and merged to get the final list.
All the sample iterators for all series are not returned upfront by Select([]matcher); there could be 100s of thousands of series as the result. They follow a similar fashion as above. An iterator is returned which iterates over the series one by one giving its sample iterator. And the sample iterator also lazily loads the chunks when asked for.

Querying multiple blocks

When you have multiple blocks overlapping with the mint through maxt of the querier, the querier is actually a merge querier which holds queriers for individual blocks. The 3 queries now effectively do the following:

LabelNames(): get the sorted label names from all blocks and do a N way merge.
LabelValues(name): get the label values from all the blocks and do a N way merge.
Select([]matcher): get the series iterator from all the blocks using the Select method and do a lazy N way merge again in an iterator fashion. This is feasible since the individual series iterators return series in sorted order w.r.t. label pairs.

Querying Head block

The Head block stores the entire map of label-value pairs and all the postings list in the memory (an example Go representation map[labelName]map[labelValue]postingsList), hence there is no special care required in accessing them. The remaining procedure for performing the 3 queries remains the same with the map and the postings list.

Code reference

tsdb/index/index.go has the code for performing the LabelNames() and LabelValues(name) queries on the persistent block and also for getting the merged postings list for given label name and values (not the matcher itself).

tsdb/querier.go has the code for performing the Select([]matcher) query on the persistent block including filtering the label values for the matchers before asking the index for postings list. tsdb/chunks/chunks.go has the code for getting the chunks from the disk.

tsdb/head.go has the code for performing all 3 queries on the Head block.

tsdb/db.go and storage/merge.go have the code for the merged querier when there are multiple blocks involved in the query.

Here is the entire Prometheus TSDB blog series

Prometheus TSDB (Part 4): Persistent Block and its Index

Sun, 18 Oct 2020 00:00:00 GMT

Introduction

In Part 1, Part 2, and Part 3, we have covered most of the things related to the Head (in-memory) block (i.e. at the time of writing this post, more things to come in Head). In this blog post, we will dive deeper into the persistent blocks which reside on disk.

There is a lot of information to digest here, so sit back and relax, and maybe grab a coffee.

What's a persistent block and when is it created

A block on disk is a collection of chunks for a fixed time range consisting of its own index. It is a directory with multiple files inside it. Every block has a unique ID, which is a Universally Unique Lexicographically Sortable Identifier (ULID).

A block has an interesting property that the samples in it are immutable. If you want to add more samples, or delete some, or update some, you have to rewrite the entire block with the required modifications and the new block has a new ID. There is no relationship between these 2 blocks. We have deletions on blocks via tombstones while not touching the samples, since re-writing a block on every delete request does not sound sane; we will discuss more about it in this blog post.

We saw in Part 1 that when the Head block fills up with data ranging chunkRange*3/2 in time, we take the first chunkRange of data and convert into a persistent block.

Here we call that chunkRange as blockRange in the context of blocks, and the first block cut from the Head spans 2h by default in Prometheus.

Looking at the overall picture of TSDB below

When the blocks get old, multiple blocks are compacted (or merged) to form a new bigger block while the old ones are deleted. So we have 2 ways of creating a block, from the Head and from existing blocks. We will look into compaction in future blog posts.

Contents of a block

A block consists of 4 parts

meta.json (file): the metadata of the block.
chunks (directory): contains the raw chunks without any metadata about the chunks.
index (file): the index of this block.
tombstones (file): deletion markers to exclude samples when querying the block.

With 01EM6Q6A1YPX4G9TEB20J22B2R as an example of block ID, here is how the files look on disk

data
├── 01EM6Q6A1YPX4G9TEB20J22B2R
|   ├── chunks
|   |   ├── 000001
|   |   └── 000002
|   ├── index
|   ├── meta.json
|   └── tombstones
├── chunks_head
|   ├── 000001
|   └── 000002
└── wal
    ├── checkpoint.000003
    |   ├── 000000
    |   └── 000001
    ├── 000004
    └── 000005

Let's dive deeper into each one of them.

1. `meta.json`

This contains all the required metadata for the block as a whole. Here is an example:

{
    "ulid": "01EM6Q6A1YPX4G9TEB20J22B2R",
    "minTime": 1602237600000,
    "maxTime": 1602244800000,
    "stats": {
        "numSamples": 553673232,
        "numSeries": 1346066,
        "numChunks": 4440437
    },
    "compaction": {
        "level": 1,
        "sources": [
            "01EM65SHSX4VARXBBHBF0M0FDS",
            "01EM6GAJSYWSQQRDY782EA5ZPN"
        ]
    },
    "version": 1
}

version tells us how to parse the meta file.

Though the directory name is set to the ULID, only the one present in the meta.json as ulid is the valid ID, the directory name can be anything.

minTime and maxTime is the absolute minumum and maximum timestamp among all the chunks present in the block.

stats tell the number of series, samples, and chunks present in the block.

compaction tells the history of the block. level tells how many generations has this block seen. sources tell from which blocks was this block created (i.e. block which were merged to form this block). If it was created from Head block, then the sources is set to itself (01EM6Q6A1YPX4G9TEB20J22B2R in this case).

2. `chunks`

The chunks directory contains a sequence of numbered files similar to the WAL/checkpoint/head chunks. Each file is capped at 512MiB. This is the format of an individual file inside this directory:

┌──────────────────────────────┐
│  magic(0x85BD40DD) <4 byte>  │
├──────────────────────────────┤
│    version(1) <1 byte>       │
├──────────────────────────────┤
│    padding(0) <3 byte>       │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │         Chunk 1          │ │
│ ├──────────────────────────┤ │
│ │          ...             │ │
│ ├──────────────────────────┤ │
│ │         Chunk N          │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘

It looks very similar to the memory-mapped head chunks file. The magic number identifies this file as a chunks file. version tells us how to parse this file. padding is for any future headers. This is then followed by a list of chunks.

Here is the format of an indivudual chunk:

┌───────────────┬───────────────────┬──────────────┬────────────────┐
│ len  │ encoding <1 byte> │ data  │ CRC32 <4 byte> │
└───────────────┴───────────────────┴──────────────┴────────────────┘

It again looks similar to the memory-mapped head chunks on disk except that it is missing the series ref, mint and maxt. We needed this additional information for the Head chunks to recreate the in-memory index during startup. But in the case of blocks, we have this additonal information in the index, because index is the place where it finally belongs, hence we don't need it here.

To access these chunks, we again need the chunk reference that we talked in Part 3. Repeating what I had said: The reference is 8 bytes long. The first 4 bytes tell the file number in which the chunk exists, and the last 4 bytes tell the offset in the file where the chunk starts (i.e. the first byte of the len). If the chunk was in the file 00093 and the len of the chunk starts at byte offset 1234 in the file, then the reference of that chunk would be (92 << 32) | 1234 (left shift bits and then bitwise OR). While the file names use 1 based indexing, the chunks references use 0 based indexing. Hence 00093 got converted to 92 when calculating the chunk reference.

Here is the link for the upstream docs on the chunks format.

3. `index`

Index contains all that you need to query the data of this block. It does not share any data with any other blocks or external entity which makes it possible to read/query the block without any dependencies.

The index is an "inverted index" which is also widely used in indexing documents. Fabian talks more about inverted index in his blog post, hence I am skipping that topic here since this post is too long already.

Here is the high level view of the index which we will dive into shortly.

┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │              Label Offset Table              │ │
│ ├──────────────────────────────────────────────┤ │
│ │             Postings Offset Table            │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

Same as other files, the magic number identifies this file as an index file. version tells us how to parse this file. The entry point to this index is the TOC, which stands for Table Of Contents. So we will first start from TOC and learn about other parts of the index.

A. `TOC`

┌─────────────────────────────────────────┐
│ ref(symbols) <8b>                       │ -> Symbol Table
├─────────────────────────────────────────┤
│ ref(series) <8b>                        │ -> Series
├─────────────────────────────────────────┤
│ ref(label indices start) <8b>           │ -> Label Index 1
├─────────────────────────────────────────┤
│ ref(label offset table) <8b>            │ -> Label Offset Table
├─────────────────────────────────────────┤
│ ref(postings start) <8b>                │ -> Postings 1
├─────────────────────────────────────────┤
│ ref(postings offset table) <8b>         │ -> Postings Offset Table
├─────────────────────────────────────────┤
│ CRC32 <4b>                              │
└─────────────────────────────────────────┘

It tells us where exactly (the byte offset in the file) do the individual components of the index start. I have marked what do each reference point to in the index format above. The starting point of next component also tell us where do individual components end. If any of the reference is 0, it indicates that the corresponding section does not exist in the index, and hence should be skipped while reading.

Since TOC is fixed size, the last 52 bytes of the file can be taken as the TOC.

As you will notice in the coming sections, each component will have its own checksum, i.e. CRC32 to check for the integrity of the underlying data.

B. `Symbol Table`

This section holds a sorted list of deduplicated strings which are found in label pairs of all the series in this block. For example if the series is {a="y", x="b"}, then the symbols would be "a", "b", "x", "y".

┌────────────────────┬─────────────────────┐
│ len <4b>           │ #symbols <4b>       │
├────────────────────┴─────────────────────┤
│ ┌──────────────────────┬───────────────┐ │
│ │ len(str_1)  │ str_1  │ │
│ ├──────────────────────┴───────────────┤ │
│ │                . . .                 │ │
│ ├──────────────────────┬───────────────┤ │
│ │ len(str_n)  │ str_n  │ │
│ └──────────────────────┴───────────────┘ │
├──────────────────────────────────────────┤
│ CRC32 <4b>                               │
└──────────────────────────────────────────┘

The len <4b> is the number of bytes in this section and #symbols is the number of symbols. It is followed by #symbols number of utf-8 encoded strings, where each string has its length prefixed followed by the raw bytes of the string. Checksum (CRC32) for integrity.

The other sections in the index can refer to this symbol table for any strings and hence significantly reduce the index size. The byte offset at which the symbol starts in the file (i.e. the start of len(str_i)) forms the reference for the corresponding symbol which can be used in other places instead of the actual string. When you want the actual string, you can use the offset to get it from this table.

C. `Series`

This section contains a list of all the series information present in this blocks. The series are sorted lexicographically by their label sets.

┌───────────────────────────────────────┐
│ ┌───────────────────────────────────┐ │
│ │   series_1                        │ │
│ ├───────────────────────────────────┤ │
│ │                 . . .             │ │
│ ├───────────────────────────────────┤ │
│ │   series_n                        │ │
│ └───────────────────────────────────┘ │
└───────────────────────────────────────┘

Each series entry is 16 byte aligned, which means the byte offset at which the series starts is divisible by 16. Hence we set the ID of the series to be offset/16 where offset points to the start of the series entry. This ID is used to reference this series and whenever you want to access the series, you can get the location in the index by doing ID*16.

Since the series are lexicographically sorted by their label sets, a sorted list of series IDs implies a sorted list of series label sets.

Here comes a confusing part for many in the index: what is a posting? The above series ID is a posting. So whenever we say posting in the context of Prometheus TSDB, it refers to a series ID. But why posting? Here is my best guess: in the world of indexing the documents and its words with an inverted index, the document IDs are usually called a "posting" in the index. Here you can consider a series to be a document and a label-value pair of a series to be words in the document. Series ID -> document ID, document ID -> posting, series ID -> posting.

Each entry holds the label set of the series and references to all the chunks belonging to this series (the reference is the one from the chunks directory).

┌──────────────────────────────────────────────────────┐
│ len                                         │
├──────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────┐ │
│ │            labels count               │ │
│ ├──────────────────────────────────────────────────┤ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ ref(l_i.name)                   │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(l_i.value)                  │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │                       ...                        │ │
│ ├──────────────────────────────────────────────────┤ │
│ │            chunks count               │ │
│ ├──────────────────────────────────────────────────┤ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ c_0.mint                         │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ c_0.maxt - c_0.mint             │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(c_0.data)                   │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ c_i.mint - c_i-1.maxt           │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ c_i.maxt - c_i.mint             │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(c_i.data) - ref(c_i-1.data)  │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │                       ...                        │ │
│ └──────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────┤
│ CRC32 <4b>                                           │
└──────────────────────────────────────────────────────┘

The starting len and ending CRC32 is same as before. The series entry starts with number of label-value pairs present in the series, as labels count, followed by lexicographically ordered (w.r.t. label name) label-value pairs, Instead of storing the actual string itself, we use the symbol reference here from the symbol table. If the series was {a="y", x="b"}, the series entry for it would include symbol reference for "a", "y", "x", "b" in the same order.

Next is the number of chunks (chunks count) that belongs to this series in the chunks directory. And this is followed by a sequence of metadata about the indexed chunks containing the min time (timestamp of first sample) and max time (timestamp of last sample) of the chunk and its reference in the chunks directory. These are sorted by the mint of the chunks. If you noticed the above format, we are actually storing mint and maxt by taking the different with the previous timestamp (mint of the same chunk or maxt of previous chunk). This reduces the size of the chunk metadata since these form a huge part of the index by size.

Holding the mint and maxt in the index allows queries to skip the chunks which are not required for the queried time range. This is different from the m-mapped Head chunks from disk where mint and maxt are with the chunks to restore them in the in-memory index of Head during startup.

D. `Label Offset Table` and `Label Index i`

Both of these are coupled, so we will discuss both together. Label Index i refers to any of Label Index 1 ... Label Index N in the index; we will talk about a single entry Label Index i.

These two are not used anymore; they are written for backward compatibility but not read from in the latest Prometheus version. However, it is useful to understand the use of these parts and we will see in the next section what is it replaced with.

The aim of these sections is to index the possible values for a label name. For example if we have two series {a="b1", x="y1"} and {a="b2", x="y2"}, this section allows us to identify that the possible values for label name a are [b1, b2] and for x they are [y1, y2]. The format also allows indexing something like the label names (a, x) have the possible values [(b1, y1), (b2, y2)], but we don't use this in Prometheus.

Label Index i

This is the format of a single Label Index i entry, so we have multiple of of these in sequence in no particular order. This is the format for a single Label Index i:

┌───────────────┬────────────────┬────────────────┐
│ len <4b>      │ #names <4b>    │ #entries <4b>  │
├───────────────┴────────────────┴────────────────┤
│ ┌─────────────────────────────────────────────┐ │
│ │ ref(value_0) <4b>                           │ │
│ ├─────────────────────────────────────────────┤ │
│ │ ...                                         │ │
│ ├─────────────────────────────────────────────┤ │
│ │ ref(value_n) <4b>                           │ │
│ └─────────────────────────────────────────────┘ │
│                      . . .                      │
├─────────────────────────────────────────────────┤
│ CRC32 <4b>                                      │
└─────────────────────────────────────────────────┘

From the above examples, this helps us store the list [b1, b2], [y1, y2], [(b1, y1), (b2, y2)], while each list getting its own entry in the index. len and CRC32 is same as before.

#names is the number of label names the values are for. For example if we are indexing for a or x, #names would be 1. If we are indexing for (a, x), i.e. 2 label names, then #names would be 2.

#entries is the number of possible values for the label names. If the names are a or x or even (a, x), #entries is 2 because they have 2 possible values each.

It is followed by #names * #entries number of references to the value symbols.

Example for [b1, b2]

┌────┬───┬───┬─────────┬─────────┬───────┐
│ 16 │ 1 │ 2 │ ref(b1) | ref(b2) | CRC32 |
└────┴───┴───┴─────────┴─────────┴───────┘

Example for [(b1, y1), (b2, y2)]

┌────┬───┬───┬─────────┬─────────┬─────────┬─────────┬───────┐
│ 24 │ 2 │ 2 │ ref(b1) | ref(y1) │ ref(b2) | ref(y2) | CRC32 |
└────┴───┴───┴─────────┴─────────┴─────────┴─────────┴───────┘

Label Offset Table

While the Label Index i stores the list of possible values, Label Offset Table brings together the labels names and completes the label name-value index.

Here is the format of Label Offset Table

┌─────────────────────┬──────────────────────┐
│ len <4b>            │ #entries <4b>        │
├─────────────────────┴──────────────────────┤
│ ┌────────────────────────────────────────┐ │
│ │  n = 1 <1b>                            │ │
│ ├──────────────────────┬─────────────────┤ │
│ │ len(name)   │ name     │ │
│ ├──────────────────────┴─────────────────┤ │
│ │  offset                     │ │
│ └────────────────────────────────────────┘ │
│                    . . .                   │
├────────────────────────────────────────────┤
│  CRC32 <4b>                                │
└────────────────────────────────────────────┘

This stores sequence of entries to point label name to its possible values, for example, point a to the above Label Index i containing [b1, b2].

The above table has len and CRC32 like other parts. #entries is the number of entries in this table. Followed by the actual entries.

Each entry start with n which is number of label names, followed by n number of actual label names and not symbols. If you noticed, the string len(name) │ name is same as how we stored in the symbol table. In Prometheus, we only have n=1, which means we only index possible label values for single label name, and not for tuples like (a, x), because the possible number of such combinations would be huge and not practical to store them all.

Since we index single label names, we can afford to store the string directly as the number of label names are usually small and hence prevent loading of disk page from symbol table for the label name lookup.

The entry ends with an offset in the file which points to the start of relevant Label Index i. For example, for label name a, the offset will point to the Label Index i storing [b1, b2]. Label name x will point to the Label Index i storing [y1, y2].

Since we are only indexing individual label names, we also don't store the Label Index i for tuples like (a, x) though we saw an example above that it is possible to do. It was once considered to have such composite label value index, but it was dropped as there were not many use cases for it.

E. `Postings Offset Table` and `Postings i`

These two are linked in a similar way as above where Postings i stores a list of postings and Postings Offset Table refers to those entries with the offset. If you can recall, a posting is a series ID, which in the context of this index is the offset at which the series entry starts in the file divided by 16 since it's 16 byte aligned.

Postings i

A single Postings i represents a "postings list", which is basically a sorted list of postings. Let us see the format of an individual such list and we will work with an example.

┌────────────────────┬────────────────────┐
│ len <4b>           │ #entries <4b>      │
├────────────────────┴────────────────────┤
│ ┌─────────────────────────────────────┐ │
│ │ ref(series_1) <4b>                  │ │
│ ├─────────────────────────────────────┤ │
│ │ ...                                 │ │
│ ├─────────────────────────────────────┤ │
│ │ ref(series_n) <4b>                  │ │
│ └─────────────────────────────────────┘ │
├─────────────────────────────────────────┤
│ CRC32 <4b>                              │
└─────────────────────────────────────────┘

This format cannot get much simpler. It has len and CRC32 as usual. Followed by #entries which is the number of postings in this list, and then a sorted list of #entries number postings (series IDs, which is also the reference).

You might be wondering which postings do we store in this list. Let's take an example of these two series: {a="b", x="y1"} with series ID 120, {a="b", x="y2"} with series ID 145. Similar to how we looked at possible label values for a label name above, here we look at the possible series for a label-value pair. From the above example, a="b" is present in both the series, so we have to store a list [120, 145]. For x="y1" and x="y2", they appear in only one of the series, so we have to store [120] and [145] for them respectively.

We only store the lists for the label pairs that we see in the series. So in the above example, we don't store postings list for something like a="y1" or x="b", because they never appear in any series.

Postings Offset Table

Like how the Label Offset Table points a label name to possible values in Label Index i, similarly Postings Offset Table points a label-pair to possible postings in Postings i.

┌─────────────────────┬──────────────────────┐
│ len <4b>            │ #entries <4b>        │
├─────────────────────┴──────────────────────┤
│ ┌────────────────────────────────────────┐ │
│ │  n = 2 <1b>                            │ │
│ ├──────────────────────┬─────────────────┤ │
│ │ len(name)   │ name     │ │
│ ├──────────────────────┼─────────────────┤ │
│ │ len(value)  │ value    │ │
│ ├──────────────────────┴─────────────────┤ │
│ │  offset                     │ │
│ └────────────────────────────────────────┘ │
│                    . . .                   │
├────────────────────────────────────────────┤
│  CRC32 <4b>                                │
└────────────────────────────────────────────┘

This looks very similar to the Label Offset Table, but with an addition of the label value. len and CRC32 is as usual.

#entries is the number of entries in this table. n is always 2, which tells number of string elements that follow (i.e. a label name and a label value). Since we have n here, the table could possibly index composite label pairs like (a="b", x="y1"), but we don't do it as the use cases for that are very limited and don't have a good trade-off.

n is followed by the actual string for the label name and the label value. Again, the individual label pairs are not a lot in general, hence we can afford storing the raw string here and avoid an indirection to the symbol table as this table will be accessed a lot of time. The main saving from the symbol table comes in the Series sections where the same symbol is repeated many times.

A single entry ends with an offset to the start of a postings list Postings i. From above example, an entry for name="a", value="b" will point to the postings list [120, 145], entry for name="x", value="y1" will point to the postings list [120].

The entries are sorted based on the label name and the value, first w.r.t. the name and for pairs with same names it's done w.r.t. the value. This allows us to run a binary search for the required label pair. Additionally, to get possible values for a given label name, we can get to the first label-pair that matches the name and iterate from there to get all the value. Hence this table replaces the Label Offset Table and Label Index i. This is another reason to store the actual strings here for faster access of label values.

This postings list and postings offset table form the inverted index. For indexing documents using an inverted index, for every word, we store a list of documents that it appears in. Similarly here, for every label-value pair, we store the list of series that it appears in.

This marks the end of the giant index section. Here is the link for the upstream docs on the index format.

4. `tombstones`

Tombstones are deletion markers, i.e., they tell us what time ranges of which series to ignore during reads. This is the only file in the block which is created and modified after writing a block to store the delete requests.

This is how the file looks

┌────────────────────────────┬─────────────────────┐
│ magic(0x0130BA30) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                Tombstone 1                   │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                Tombstone N                   │ │
│ ├──────────────────────────────────────────────┤ │
│ │                  CRC<4b>                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

The magic number tells that this is a tombtones file (guess whose birthday is this number? hint: a Prometheus maintainer who implemented deletions in TSDB). The version tells us how to parse the file. It is followed by a sequence of tombstones which we will look at in just a second. The file ends with a checksum (CRC32) over all the tombstones.

Each individual tombstone looks like this

┌────────────────────────┬─────────────────┬─────────────────┐
│ series ref  │ mint  │ maxt  │
└────────────────────────┴─────────────────┴─────────────────┘

The first field is the series reference (aka series ID, aka a posting) to which this tombstone belongs to. The mint through maxt is the time range that the deletion refers to, hence we should be skipping that time range for the series mentioned by the series ref while reading the chunks. When a single series has multiple non-overlapping deleted time ranges, they result in more than 1 tombtone.

Here is the link for the upstream docs on the tombstones format.

Epilouge

In case of Head block, we have the inverted index in the memory along with the label name to possible values mapping efficiently stored in the memory.

In this blog post we have seen how the block looks on disk. Especially the index in detail which forms the majority of this post. You might be having many questions, like, what's the use of those sections in the index, what role they play during a query, what kind of queries are generally run on a block or the index, etc.

Since this blog post is already too long, we will be looking at all of them in the next blog post where we will talk about queries.

Code reference

tsdb/block.go has the code for reading and writing the meta file. In general, this is the hub for all things persistent block.

tsdb/chunks/chunks.go has the code for reading and writing the files in the chunks directory.

tsdb/index/index.go has the code for reading and writing the index file.

tsdb/tombstones/tombstones.go has the code for reading and writing the tombstones file.

All these files point to the implementation of individual components of the block. We will see the code which brings all this together during the reading and writing of a block in the queries and compaction blog posts respectively.

Here is the entire Prometheus TSDB blog series

Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk

Fri, 02 Oct 2020 00:00:00 GMT

Introduction

In the Part 1 of the TSDB blog series I mentioned that once a chunk is "full", it is flushed to the disk and memory mapped. This helps in reducing the memory footprint of the Head block and also helps speed up the WAL replay that we discussed in Part 2. We will be diving deeper into how this is designed in Prometheus in this blog post.

As this is a part of the Prometheus TSDB blog series that I am writing, you are recommended to read the Part 1 to know where these memory mapped chunks fit into TSDB (or the Head block) and Part 2 to understand the WAL replay.

I have also given a KubeCon talk on this which explains this at a little higher level.

Writing these chunks

Recapping from Part 1, when a chunk is full, we cut a new chunk and the older chunks become immutable and can only be read from (the yellow block below).

And instead of storing it in memory, we flush it to disk and store a reference to access it later.

This flushed chunk is the memory-mapped chunk from disk. The immutability is the most important factor here else rewriting compressed chunks would have been too inefficient for every sample.

Format on disk

The format can also be found on GitHub.

The File

These chunks stay in its own directory called chunks_head and have a file sequence similar to WAL (except it starts with 1). For example:

data
├── chunks_head
|   ├── 000001
|   └── 000002
└── wal
    ├── checkpoint.000003
    |   ├── 000000
    |   └── 000001
    ├── 000004
    └── 000005

The max size of the file is kept at 128MiB. Now diving deeper into a single file, the file contains a header of 8B.

┌──────────────────────────────┐
│  magic(0x0130BC91) <4 byte>  │
├──────────────────────────────┤
│    version(1) <1 byte>       │
├──────────────────────────────┤
│    padding(0) <3 byte>       │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │         Chunk 1          │ │
│ ├──────────────────────────┤ │
│ │          ...             │ │
│ ├──────────────────────────┤ │
│ │         Chunk N          │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘

Magic Number is any number that can uniquely identify the file as a memory-mapped head chunks file. As I implemented this feature, I set it to my birth date :). Chunk Format tells us how to decode the chunks in the file. The extra padding is to allow any future header options that we might require.

After the file header, what follows is chunks.

Chunks

A single chunk looks like this

┌─────────────────────┬───────────────────────┬───────────────────────┬───────────────────┬───────────────┬──────────────┬────────────────┐
| series ref <8 byte> | mint <8 byte, uint64> | maxt <8 byte, uint64> | encoding <1 byte> | len  | data  │ CRC32 <4 byte> │
└─────────────────────┴───────────────────────┴───────────────────────┴───────────────────┴───────────────┴──────────────┴────────────────┘

The series ref is the same series reference that we talked about in Part 2, it is the series id used to access the series in the memory. The mint and maxt are the minimum and maximum timestamp seen in the samples of the chunk. encoding is the encoding used to compress the chunks. len is the number of bytes that follow from here and data are the actual bytes of the compressed chunk.

CRC32 is the checksum of the above content of the chunk used to check the integrity of the data.

Reading these chunks

For every chunk, the Head block stores the mint and maxt of that chunk along with a reference in the memory to access it.

The reference is 8 bytes long. The first 4 bytes tell the file number in which the chunk exists, and the last 4 bytes tell the offset in the file where the chunk starts (i.e. the first byte of the series ref). If the chunk was in the file 00093 and the series ref starts at byte offset 1234 in the file, then the reference of that chunk would be (93 << 32) | 1234 (left shift bits and then bitwise OR).

We store the mint and maxt in Head so that we can select the chunk without having to look at the disk. When we do have to access the chunk, we only access the encoding and the chunk data using the reference.

In the code, the file looks like yet another byte slice (one slice per file) and accessing the slice at some index to get the chunk data while the OS maps the slice in the memory to the disk under the hood. Memory-mapping from disk is an OS feature which fetches only the part of disk into memory which is being accessed and not the entire file.

Replaying on startup

In Part 2 we talked about WAL replay where we replay each individual sample to re-create the compressed chunk. Now that we have the compressed full chunks on disk, we don't need to go through recreation of these chunks while we still need to create chunks from WAL which were not full. Now with these memory-mapped chunks from disk, the replay happens as follows.

At startup, first we iterate through all the chunks in the chunks_head directory and build a map of series ref -> [list of chunk references along with mint and maxt belonging to this series ref] in the memory.

We then continue with the WAL replay as described in Part 2 but with few modification:

When we come across the Series record, after creation of the series, we look for the series reference in the above map and if any memory-mapped chunks exist, we attach that list to this series.
When we come across the Samples record, if the corresponding series for the sample has any memory-mapped chunks and if the sample falls into the time ranges that it covers, then we skip the sample. If it does not, then we ingest that sample into the Head block.

Enhancements that this brings in

What's the use of this additional complexity while we could get away with storing chunks in the memory and the WAL? This feature was added recently in 2020, so let's see what this brings in. (You can see some benchmark graphs in this Grafana Labs blog post)

Memory savings

If you had to store the chunk in the memory, it can take anywhere between 120 to 200 bytes (or even more depending on compressibility of the samples). Now this is replaced with 24 bytes - 8 bytes each of chunks reference, min time, and max time of the chunk.

While this may sound like 80-90% reduction in memory, the reality is different. There are more things that the Head needs to store, like the in-memory index, all the symbols (label values), etc, and other parts of TSDB that take some memory.

In the real world, we can see a 15-50% reduction in the memory footprint depending on the rate at which samples are being scraped and the rate at which new series are being created (called "churn"). Another thing to note is that, if you are running some queries which touch a lot of these chunks on disk, then they need to be loaded into the memory to be processed. So it's not an absolute reduction in peak memory usage.

Faster startup

The WAL replay is the slowest part of startup. Mainly, (1) decoding of WAL records from disk and (2) rebuilding the compressed chunks from individual samples, are the slow parts in the replay. The iteration of memory-mapped chunks is relatively fast.

We cannot avoid decoding of records as we need to check all the records. As you saw above in the replay, we are skipping the samples which are in the memory-mapped chunks range. Here we avoid re-creating those full compressed chunks, hence save some time in the replay. It has been seen to reduce the startup time by 15-30%.

Garbage collection

The garbage collection in memory happens during the Head truncation where it just drops the reference of the chunks which is older than the truncation time T. But the files are still present on the disk. As with WAL segments, we also need to delete old m-mapped files regularly.

For every memory-mapped chunk file present (which means also open in TSDB), we store in the memory the absolute maximum time among all the chunks present in the file. For the live file (the one in which we are currently writing the chunks), we update this maximum time in the memory as and when we are adding new chunks. During a restart, as we iterate all the memory-mapped chunks, we restore the maximum time of the files in the memory there.

So when the Head truncation is happening for data before time T, we call truncation on these files for time T. The files whose maximum times is below T (except the live file) are deleted at this point while preserving the sequence (if the files were 5, 6, 7, 8 and if files 5 and 7 were beyond time T, only 5 is deleted and the remaining sequence would be 6, 7, 8).

After truncation, we close the live file and start a new one because in low volume and small setups, it might take a lot of time to reach the max size of the file. So rotating the files here will help deletion of old chunks during the next truncation.

Code reference

tsdb/chunks/head_chunks.go has all the implementation of writing chunks to disk, accessing it using a reference, truncation, handling the files, and way to iterate over the chunks.

tsdb/head.go uses the above as a black box to memory-map its chunks from disk.

Here is the entire Prometheus TSDB blog series

Prometheus TSDB (Part 2): WAL and Checkpoint

Sat, 26 Sep 2020 00:00:00 GMT

Introduction

In the Part 1 of the TSDB blog series I mentioned that we write the incoming samples into Write-Ahead-Log (WAL) first for durability and that when this WAL is truncated, a checkpoint is created. In this blog post, we will briefly discuss the basics of WAL and then dive into how WAL and checkpoints are designed in Prometheus' TSDB.

As this is a part of the Prometheus TSDB blog series that I am writing, you are recommended to read the Part 1 to know where the WAL fits into the TSDB.

WAL Basics

WAL is a sequential log of events that occur in a database. Before writing/modifying/deleting the data in the database, the event is first recorded (appended) into the WAL and then the necessary operations are performed in the database.

For whatever reason if the machine or the program decides to crash, you have the events recorded in this WAL which you can replay back in the same order to restore the data. This is particularly useful for in-memory databases where if the database crashes, the entire data in the memory is lost if not for WAL.

This is widely used in relational databases to provide durability (D from ACID) for the database. Similarly, Prometheus has a WAL to provide durability for its Head block. Prometheus also uses WAL for graceful restarts to restore the in-memory state.

In the context of Prometheus, WAL is only used to record the events and restore the in-memory state when starting up. It does not involve in any other way in read or write operations.

Writing to WAL in Prometheus TSDB

Types of records

The write request in TSDB consists of label values of the series and their corresponding samples. This gives us two types of records, Series and Samples.

The Series record consists of the label values of all the series in the write request. The creation of series yields a unique reference which can be used to look up the series. Hence the Samples record contains the reference of the corresponding series and list of samples that belongs to that series in the write request.

The last type of record is Tombstones used for delete requests. It contains the deleted series reference with time ranges to delete.

The format of these records can be found here, we won't be discussing them in the blog post.

Writing them

The Samples record is written for all write requests that contain a sample. The Series record is written only once for a series when we see it for the first time (hence "create" it in the Head).

If a write request contains a new series, the Series record is always written before the Samples record, else during replay the series reference in the Samples record won't point to any series if the Samples record is placed before Series.

The Series record is written after creation of the series in the Head to also store the reference in the record, while Samples record is written before adding samples to the Head.

Only one Series and Samples record is written per write request by grouping all the different time series (and samples of different time series) in the same record. If the series for all the samples in the request already exist in the Head, only a Samples record is written into the WAL.

When we receive a delete request, we don't immediately delete it from the memory. We store something called "tombstones" which indicates the deleted series and time range of deletion. We write a Tombstones record into the WAL before processing the delete request.

How it looks on disk

The WAL is stored as a sequence of numbered files with 128MiB each by default. A WAL file here is called a "segment".

data
└── wal
    ├── 000000
    ├── 000001
    └── 000002

The size of a file is bounded to make garbage collection of old files simpler. As you can guess, the sequence number always increases.

WAL truncation and Checkpointing

We need to regularly delete the old WAL segments, else, the disk will eventually fill up and the startup of TSDB will take a lot of time as it has to replay all the events in this WAL (where most of it will be discarded because it’s old). In general, any data that is no longer needed, you want to get rid of it.

WAL truncation

The WAL truncation is done just after the Head block is truncated (see Part 1 for Head truncation). The files cannot be deleted at random and the deletion happens for first N files while not creating a gap in the sequence.

Because the write requests can be random, it is not easy or efficient to determine the time range of the samples in a WAL segment without going through all the records. So we delete the first 2/3rd of the segments.

data
└── wal
    ├── 000000
    ├── 000001
    ├── 000002
    ├── 000003
    ├── 000004
    └── 000005

In the above example, the files 000000 000001 000002 000003 will be deleted.

There is one catch here: the series records are written only once, so if you blindly delete the WAL segments, you will lose those records and hence can't restore those series on startup. Also, there might be samples in those first 2/3rd segments which are not truncated from the Head yet, hence you lose them too. This is where checkpoints come into picture.

Checkpointing

Before truncating the WAL, we create a "checkpoint" from those WAL segments to be deleted. You can consider a checkpoint as a filtered WAL. Consider if the truncation of Head is happening for data before time T, taking the above example of WAL layout, the checkpointing operation will go through all the records in 000000 000001 000002 000003 in order and:

Drops all the series records for series which are no longer in the Head.
Drops all the samples which are before time T.
Drops all the tombstone records for time ranges before T.
Retain back remaining series, samples and tombstone records in the same way as you find it in the WAL (in the same order as they appear in the WAL).

The drop operation can also be a re-write operation while dropping the unnecessary items from the record (because a single record can contain more than one series, sample or tombstone).

This way you won't lose the series, samples and tombstones which are still in the Head. The checkpoint is named as checkpoint.X where X is the last segment number on which the checkpoint was being created (00003 here; you will know why we do like this in the next section).

After the WAL truncation and checkpointing, the files on disk look something like this (checkpoint looks like yet another WAL):

data
└── wal
    ├── checkpoint.000003
    |   ├── 000000
    |   └── 000001
    ├── 000004
    └── 000005

If there were any older checkpoints, they are deleted at this point.

Replaying the WAL

We first iterate over the records in order from the last checkpoint (the checkpoint with the biggest number associated with it is the last). For checkpoint.X, X tells us from which WAL segment we need to continue the replay, and that is X+1. So in the above example, after replaying checkpoint.000003, we continue the replay from WAL segment 000004.

You might be thinking why we need to track the segment number in the checkpoint while we anyway delete the WAL segments before it. The thing is, creation of a checkpoint and deletion of WAL segments are not atomic. Anything can happen in between and prevent deletion of WAL segments. So we will have to replay the additional 2/3rd of the WAL segments which would have been deleted, making replay slower.

Talking about individual records, the following actions are taken on them:

Series: Create the series in the Head with the same reference as mentioned in the record (so that we can match the samples later). There could be multiple series records for the same series which is handled by Prometheus by mapping the references.
Samples: Add samples from this record to the Head. The reference in the record indicates which series to add to. If no series is found for the reference, the sample is skipped.
Tombstones: Store those tombstones back in Head by using the reference to identify the series.

Low level details of writing to and reading from WAL

When the write requests are coming at a high volume, you want to avoid writing to disk randomly to avoid write amplification. Additionally, when you are reading the record, you want to be sure that it is not corrupted (easily possible on abrupt shutdown or faulty disk).

Prometheus has a general implementation of WAL where a record is just a slice of bytes and the caller has to take care of encoding the record. To solve the above two problems, the WAL package does the following:

Data is written to the disk one page at a time. One page is 32KiB long. If the record is bigger than 32KiB, it is broken down into smaller pieces with each piece receiving a WAL record header for some bookkeeping to know if the piece is the end of record, or the start, or in the middle (A record receives a WAL record header even if it fits in the page).
A checksum of the record is appended at the end to detect any corruption while reading.

The WAL package takes care of seamlessly joining the pieces of records and checks the checksum of the record while iterating through the records for replay.

The WAL records are not heavily compressed by default (or compressed at all). So the WAL package gives an option to compress the records using Snappy (enabled by default now). This information is stored in the WAL record header, so the compressed and uncompressed records can live together if you plan to enable or disable compression.

Code reference

The WAL implementation which takes record as slice of bytes and does the low level disk interactions is present in tsdb/wal/wal.go. This file has the implementation for both writing the byte records and also iterating the records (again as a slice of bytes).

tsdb/record/record.go contains the various records with its encoding and decoding logic.

The checkpointing logic is present in tsdb/wal/checkpoint.go.

tsdb/head.go contains the remaining:

Creating and encoding the records and calling the WAL write.
Calling the checkpointing and WAL truncation.
Replaying the WAL records, decoding them and restoring the in-memory state.

Here is the entire Prometheus TSDB blog series

Prometheus TSDB (Part 1): The Head Block

Sat, 19 Sep 2020 00:00:00 GMT

Introduction

Though Prometheus 2.0 was launched about 3 years ago, there are not much resources to understand it's TSDB other than Fabian's blog post, which is very high level, and the docs on formats is more like a developer reference.

The Prometheus' TSDB has been attracting lots of new contributors lately and understanding it has been one of the pain points due to lack of resources. So, I plan to discuss in detail about the working of TSDB in a series of blog posts along with some references to the code for the contributors.

In this blog post, I mainly talk about the in-memory part of the TSDB — the Head block — while I will dive deeper into other components like WAL and it's checkpointing, how the memory-mapping of chunks is designed, compaction, the persistent blocks and it's index, and the upcoming snapshotting of chunks in future blog posts.

Prologue

Fabian's blog post is a good read to understand the data model, core concepts, and the high level picture of how the TSDB is designed. He also gave a talk at PromCon 2017 on this. I recommend reading the blog post or watching the talk before you dive into this one to set a good base.

All of what I explain in this blog post about the lifecycle of a sample in Head is also explained in my KubeCon talk if you prefer that.

Small Overview of TSDB

In the figure above, the Head block is the in-memory part of the database and the grey blocks are persistent blocks on disk which are immutable. We have a Write-Ahead-Log (WAL) for durable writes. An incoming sample (the pink box) first goes into the Head block and stays into the memory for a while, which is then flushed to the disk and memory-mapped (the blue box). And when these memory mapped chunks or the in-memory chunks get old to a certain point, they are flushed to the disk as persistent blocks. Further multiple blocks are merged as they get old and finally deleted after they go beyond the retention period.

Life of a Sample in the Head

All the discussions here are about a single time series and the same applies to all the series.

The samples are stored in compressed units called a "chunk". When a sample is incoming, it is ingested into the "active chunk" (the red block). It is the only unit where we can actively write data.

While committing the sample into the chunk, we also record it in the Write-Ahead-Log (WAL) on disk (the brown block) for durability (which means we can recover the in-memory data from that even if the machine crashes abruptly). I will write a separate blog post about how WAL is handled in Prometheus.

Once the chunk fills till 120 samples (or) spans upto chunk/block range (let's call it chunkRange), which is 2h by default, a new chunk is cut and the old chunk is said to be "full". For this blog post, we will consider the scape interval to be 15s, so 120 samples (a full chunk) would span 30m.

The yellow block with number 1 on it is the full chunk which just got filled while the red chunk is the new chunk that was created.

Since Prometheus v2.19.0, we are not storing all the chunks in the memory. As soon as a new chunk is cut, the full chunk is flushed to the disk and memory-mapped from the disk while only storing a reference in the memory. With memory-mapping, we can dynamically load the chunk into the memory with that reference when needed; it's a feature provided by the Operating System.

Similarly, as new samples keep coming in, new chunks are cut.

And they are flushed to the disk and memory-mapped.

After some time the Head block would look like above. If we consider the red chunk to be almost full, then we have 3h of data in Head (6 chunks spanning 30m each). That is chunkRange*3/2.

When the data in the Head spans chunkRange*3/2, the first chunkRange of data (2h here) is compacted into a persistent block. If you noticed above, the WAL is truncated at this point and a "checkpoint" is created (not shown in the diagram). I will be going into details of this checkpointing, WAL truncation, compaction, persistent block and it's index in future blog posts.

This cycle of ingestion of samples, memory-mapping, compaction to form a persistent block, continues. And this forms the basic functionality of the Head block.

Few more things to note/understand

Where is the index?

It is in the memory and stored as an inverted index. More about the overall idea of this index is in Fabian's blog post. When the compaction of Head block occurs creating a persistent block, Head block is truncated to remove old chunks and garbage collection is done on this index to remove any series entries that do not exist anymore in the Head.

Handling Restarts

In case the TSDB has to restart (gracefully or abruptly), it uses the on-disk memory-mapped chunks and the WAL to replay back the data and events and recontruct the in-memory index and chunk.

Code reference

tsdb/db.go coordinates the overall functioning of the TSDB.

For the parts relevant in the blog post, the core logic of ingestion for the in-memory chunks is all in tsdb/head.go which uses WAL and memory mapping as a black box.

Here is the entire Prometheus TSDB blog series

“Optimisations” to Avoid

Sun, 06 Sep 2020 00:00:00 GMT

Introduction

During the transition from writing code for the University assignments to writing real world softwares, I had to unlearn many things and learn to not overuse low level optimisations where it was not required. I had a big exposure to compiler optimisations at my University.

I am going to share a couple of “optimisations” to avoid that I see newbies often do, and I won't deny that I haven't done them myself (and still do). I have that in quotes because they are not really optimisations when all factors considered. I will follow up with more blog posts when I have more patterns to share.

As I work mostly on Go, some terms that I use will be Go specific (for example package).

The Optimisations

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." — Donald Knuth

1. Not breaking a function into smaller logical functions

...to avoid more function calls and maximum reuse of variables; at least that’s what I thought of when I used to keep a big function .

While calling many different methods is a performance overhead in terms of copying of function arguments and machine instruction indirection which might lead to cache misses, it is often too tiny to even notice. Smaller isolated logic is easier to understand, saves time in reviews and debugging, and easier to make changes.

2. Using globals when not necessary

I still remember the time when I used globals to “efficiently” pass data between functions during one of my internships and my manager was visibly annoyed. Fix was to pass those globals as function arguments from the origin.

While it might seem that globals simplify your code, when used incorrectly, it can often bring a lot of problems (especially when the package is used in more than one place) and make it hard to understand the behaviour. Use globals only for constants and for config variables in rare cases which cannot be passed in other ways.

If the function argument to be passed is large, then pass it as a reference (with proper care to not change the data where not intended).

3. Don’t spend a lot of time optimising

This one is repeating the above quote which I came across midway writing this blog post: don’t spend a lot of time optimising the code which is not on the hot path and that is not run often. Focus on readability and evolvability. The time spent on such optimisations often negates the benefits (if any).

Epilogue

Keep it simple. It is usually the case that someone else is going to touch (and also maintain) the code that you wrote. That's one of the biggest lessons I've learnt since I started "real world" coding where you often collaborate.

Google Summer of Code 2018 with Prometheus

Wed, 22 Aug 2018 00:00:00 GMT

Introduction

I successfully completed Google Summer of Code with Prometheus in the summer of 2018. I was mentored by Goutham Veeramachaneni

I did 3 independant addition/fixes as a part of my GSoC. All related to rules/alerting rules. Apart from my proposal, I also fixed some bugs in prometheus/tsdb during GSoC period.

Fixes

1) Persist `for` State of Alerts

Prometheus had 1 serious long standing issue, where, if the Prometheus server crashes, the state of the alert is lost.

Consider that you have an alert with for duration as 24hrs, and Prometheus crashed while that alert has been active for 23hrs, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for 24hrs again before firing!

Jump to this section

Features

2) Unit Testing for Rules

Alerting is an important feature in monitoring when it comes to maintaining site reliability, and Prometheus is being used widely for this. We also record many rules to visualise later. Hence it becomes very important to be able to check the correctness of the rules.

In this feature, I added the support of unit testing for both alerting and recording rules.

Jump to this section

3) UI for Testing Alerting Rules

As you saw above how important alerting rules are for monitoring, Prometheus also lacks any good and convenient way of visualising and testing the alert rules before it can be used.

In this feature I add a UI for entering your alerting rules and testing+visualising it on the real data that is there in your server.

Jump to this section

Epilogue

This work would not have been possible without valuable inputs and reviews by Brian Brazil and Julius Volz

I gave a lightning talk at PromCon 2018 regarding all that you read above. It was held in Munich, Germany.

1) Persist `for` State of Alerts

Introduction

This happens to be the first issue I fixed during my GSoC. You can find the PR#4061 here, which is already merged into Prometheus master and is available from v2.4.0 onwards.

This post assumes that you have a basic understanding of what monitoring is and how alerting is related to it. If you are new to this world, this post should help you get started.

Issue

To talk about alerting in Prometheus in layman terms, an alerting rule consists of a condition, for duration, and a blackbox to handle the alert. So the simple trick here is, if the condition is true for for duration amount of time, we trigger an alert (called as 'firing' of alert) and give it to the blackbox to handle it in the way it wants, which can be sending a mail, message in slack, etc.

As discussed here, consider that you have an alert with for duration as 24hrs, and Prometheus crashed while that alert has been active (condition is true) for 23hrs, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for 24hrs again before firing!

You can find the GitHub issue #422 here

The Fix

Use time series to store the state! The procedure is something like this:

During every evaluation of alerting rules, we record the ActiveAt (when did condition become true for the first time) of ever alert in a time series with name ALERTS_FOR_STATE, with all the labels of that alert. This is like any other time series, but only stored in local.
When Prometheus is restarted, a job runs for restoring the state of alerts after the second evaluation. We wait till the second evaluation so that we have enough data scraped to know the current active alerts.
For each alert which is active right now, the job looks for its corresponding ALERTS_FOR_STATE time series. The timestamp and the value of the last sample of the series gives us the info about when did Prometheus went down and when was the alert last active at.
So if the for duration was say D, alert became active at X and Prometheus crashed at Y, then the alert has to wait for more D-(Y-X) duration (Why? Think!). So variables of the alert are adjusted to make it wait for more D-(Y-X) time before firing, and not D.

Things to keep in mind

rules.alert.for-outage-tolerance | default=1h

This flag specifies how long Prometheus will be tolerant on downtime. So if Prometheus has been down longer than the time set in this flag, then the state of the alerts are not restored. So make sure to either change the value of flag depending on your need or get Prometheus up soon!

rules.alert.for-grace-period | default=10m

We would not like to fire an alert just after Prometheus is up. So we introduce something called "grace period", where if D-(Y-X) happens to be less than rules.alert.for-grace-period, then we wait for the grace period duration before firing the alert.

Note: We follow this logic only if the for duration was itself ≥ rules.alert.for-grace-period.

Gotchas

As the ALERTS_FOR_STATE series is stored in local storage, if you happen to lose the local TSDB data while Prometheus is down, then you lose the state of the alert permanently.

2) Unit Testing for Rules

Introduction

It is always good to do 1 last check of all the components of your code before you deploy it. We have seen how important alerting and recording is in the monitoring world. So why not test even the alerting and recording rules?

This was proposed long back in this GitHub issue #1695, and I worked on this during my GSoC. The work can be found in this PR#4350, which has been merged with Prometheus master.

Syntax

We use a separate file for specifying unit tests for alerting rules and PromQL expressions (in place of recording rules). This syntax of the file is based on this design doc which was constantly reviewed by Prometheus members.

Edit: This blog post will not be updated with any changes to unit testing. It might get outdated in future, hence also have a look at official documentation here.

The File

# This is a list of rule files to consider for testing.
rule_files:
  [ - > ]

# optional, default = 1m
evaluation_interval: >

# The order in which group names are listed below will be the order of evaluation of 
# rule groups (at a given evaluation time). The order is guaranteed only for the groups mentioned below. 
# All the groups need not be mentioned below.
group_eval_order:
  [ - > ]

# All the test are listed here.
tests:
  [ - > ]

# Series data
interval: >
input_series:
  [ - > ]

# Unit tests for the above data.

# Unit tests for alerting rules. We consider the alerting rules from the input file.
alert_rule_test:
  [ - > ]

# Unit tests PromQL expressions.
promql_expr_test:
  [ - > ]

# This follows the series notation (x{a="b", c="d"}). You can see an example below.
series: >

# This uses expanding notation. Example below.
values: >

Prometheus allows you to have same alertname for different alerting rules. Hence in this unit testing, you have to list the union of all the firing alerts for the alertname under a single .

# It's the time elapsed from time=0s when the alerts have to be checked.
eval_time: >

# Name of the alert to be tested.
alertname: >

# List of expected alerts which are firing under the given alertname at 
# given evaluation time. If you want to test if an alerting rule should 
# not be firing, then you can mention the above fields and leave 'exp_alerts' empty.
exp_alerts:
  [ - > ]

Remember, this alert shoud be firing.

# These are the expanded labels and annotations of the expected alert. 
# Note: labels also include the labels of the sample associated with the 
# alert (same as what you see in `/alerts`, without series `__name__` and `alertname`)
exp_labels:
  [ : > ]
exp_annotations:
  [ : > ]

# Expression to evaluate
expr: >

# It's the time elapsed from time=0s when the alerts have to be checked.
eval_time: >

# Expected samples at the given evaluation time.
exp_samples:
  [ - > ]

# Labels of the sample in series notation.
labels: >

# The expected value of the promql expression.
value: >

Example

This is an example input files for unit testing which passes the test. alerts.yml contains the alerting rule, tests.yml follows the syntax above.

`alerts.yml`

# This is the rules file.

groups:
- name: example
  rules:

  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
        severity: page
    annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  - alert: AnotherInstanceDown
    expr: up == 0
    for: 10m
    labels:
        severity: page
    annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."`>

`test.yml`

# This is the main input for unit testing. 
# Only this file is passed as command line argument.

rule_files:
    - alerts.yml

evaluation_interval: 1m

tests:
 # Test 1.
    - interval: 1m
  # Series data.
      input_series:
          - series: 'up{job="prometheus", instance="localhost:9090"}'
            values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
          - series: 'up{job="node_exporter", instance="localhost:9100"}'
            values: '1 1 1 1 1 1 1 0 0 0 0 0 0 0 0'
          - series: 'go_goroutines{job="prometheus", instance="localhost:9090"}'
            values: '10+10x2 30+20x5'
          - series: 'go_goroutines{job="node_exporter", instance="localhost:9100"}'
            values: '10+10x7 10+30x4'

 # Unit test for alerting rules.
    alert_rule_test:
    # Unit test 1.
        - eval_time: 10m
          alertname: InstanceDown
          exp_alerts:
      # Alert 1.
              - exp_labels:
                    severity: page
                    instance: localhost:9090
                    job: prometheus
                exp_annotations:
                    summary: "Instance localhost:9090 down"
                    description: "localhost:9090 of job prometheus has been down for more than 5 minutes."
 # Unit tests for promql expressions.
    promql_expr_test:
    # Unit test 1.
        - expr: go_goroutines > 5
          eval_time: 4m
          exp_samples:
      # Sample 1.
              - labels: 'go_goroutines{job="prometheus",instance="localhost:9090"}'
                value: 50
      # Sample 2.
              - labels: 'go_goroutines{job="node_exporter",instance="localhost:9100"}'
                value: 50

Usage

This feature will come embedded in promtool.

# For the above example.
./promtool test rules test.yml

# If you have multiple such test files, say test{1,2,3}.yml
./promtool test rules test1.yml test2.yml test3.yml

What is tested?

Syntax of the rule files included in the test.
Correcness of template variables. Note that, if you have used $labels.something_wrong, it wont be caught at this stage.
If the alerts listed for the alertname are exactly same as what we get after simulation over the data.
Exact match for the samples returned by PromQL expressions at given time. Order doesn't matter.

While we do the matches in 3 and 4, usage of $labels.something_wrong will be caught as it will result in an empty string.

3) UI for Testing Alerting Rules

Introduction

Before this work, Prometheus lacked any good and convenient way of visualising and testing the alert rules before it can be used. Requests for the same have been made long ago in these issues #1154 1220, long standing!

It will be added to Prometheus with this PR#4277. Now let's learn more about this.

The UI

You will be able to access this tool at /alert-rule-testing.

open images in new tab for a better view

This is what you will see when you first open.

You will enter your rules here in the same format as your would write your rule file.

After you press Execute, you will see success/error messages here.

If it was a success, you will see the graphs for the alert expression and ALERT series simulated over the existing data. Graphs are plotted only for the active alerts.

Example

A simple alerting rule, and hit Execute!

There is 1 active alert, you can see it's info here.

Graph of the expression and the corresponding ALERT graph. You can see that the alerting rule would save switched between pending and firing state twice in the current data.

Example for errors

Error in expr.

Error in template variables.

Stay tuned, it will be added to Prometheus soon!

Ganesh Vernekar Blog

Prometheus TSDB (Part 7): Snapshot on Shutdown

Introduction​

About snapshot​

Snapshot format​

Restoring in-memory state​

Faster restarts​

Few things to be aware of​

Pre-answering some questions​

Code reference​

Here is the entire Prometheus TSDB blog series​

Prometheus TSDB (Part 6): Compaction and Retention

Introduction​

Compaction​

Step 1: The "plan"​

Condition 1: Overlapping blocks​

Condition 2: Preset time ranges​

Condition 3: Tombstones covering some % of series​

Step 2: The compaction itself​

Head compaction​

Retention​

Time based retention​

Size based retention​

Code reference​

Here is the entire Prometheus TSDB blog series​

Prometheus TSDB (Part 5): Queries

Introduction​

Prologue​

Types of TSDB Queries​

LabelNames()​

LabelValues(name)​

Select([]matcher)​

Matcher​

Selecting samples​

Getting postings for a single matcher​

Postings for multiple matchers​

Getting the samples finally​

Some Implementation Details​

Querying multiple blocks​

Querying Head block​

Code reference​

Here is the entire Prometheus TSDB blog series​

Prometheus TSDB (Part 4): Persistent Block and its Index

Introduction​

What's a persistent block and when is it created​

Contents of a block​

1. meta.json​

2. chunks​

3. index​

A. TOC​

B. Symbol Table​

C. Series​

D. Label Offset Table and Label Index i​

E. Postings Offset Table and Postings i​

4. tombstones​

Epilouge​

Code reference​

Here is the entire Prometheus TSDB blog series​

Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk

Introduction​

Writing these chunks​

Format on disk​

The File​

Chunks​

Reading these chunks​

Replaying on startup​

Enhancements that this brings in​

Memory savings​

Faster startup​

Garbage collection​

Code reference​

Here is the entire Prometheus TSDB blog series​

Prometheus TSDB (Part 2): WAL and Checkpoint

Introduction​

WAL Basics​

Writing to WAL in Prometheus TSDB​

Types of records​

Writing them​

How it looks on disk​

WAL truncation and Checkpointing​

Introduction

About snapshot

Snapshot format

Restoring in-memory state

Faster restarts

Few things to be aware of

Pre-answering some questions

Code reference

Here is the entire Prometheus TSDB blog series

Introduction

Compaction

Step 1: The "plan"

Condition 1: Overlapping blocks

Condition 2: Preset time ranges

Condition 3: Tombstones covering some % of series

Step 2: The compaction itself

Head compaction

Retention

Time based retention

Size based retention

Code reference

Here is the entire Prometheus TSDB blog series

Introduction

Prologue

Types of TSDB Queries

`LabelNames()`

`LabelValues(name)`

`Select([]matcher)`

Matcher

Selecting samples

Getting postings for a single matcher

Postings for multiple matchers

Getting the samples finally

Some Implementation Details

Querying multiple blocks

Querying Head block

Code reference

Here is the entire Prometheus TSDB blog series

Introduction

What's a persistent block and when is it created

Contents of a block

1. `meta.json`

2. `chunks`

3. `index`

A. `TOC`

B. `Symbol Table`

C. `Series`

D. `Label Offset Table` and `Label Index i`

E. `Postings Offset Table` and `Postings i`

4. `tombstones`

Epilouge

Code reference

Here is the entire Prometheus TSDB blog series

Introduction

Writing these chunks

Format on disk

The File

Chunks

Reading these chunks

Replaying on startup

Enhancements that this brings in

Memory savings

Faster startup

Garbage collection

Code reference

Here is the entire Prometheus TSDB blog series

Introduction

WAL Basics

Writing to WAL in Prometheus TSDB

Types of records

Writing them

How it looks on disk

WAL truncation and Checkpointing

WAL truncation

Checkpointing

Replaying the WAL

Low level details of writing to and reading from WAL

Code reference

Here is the entire Prometheus TSDB blog series

Introduction