<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Ganesh Vernekar Blog</title>
        <link>https://ganeshvernekar.com/blog</link>
        <description>Ganesh Vernekar Blog</description>
        <lastBuildDate>Tue, 14 Sep 2021 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright © 2026 Ganesh Vernekar.</copyright>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 7): Snapshot on Shutdown]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown</guid>
            <pubDate>Tue, 14 Sep 2021 00:00:00 GMT</pubDate>
            <description><![CDATA[Taking a snapshot of in-memory data during the shutdown for faster restarts]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>In <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">part 2</a> we saw that TSDB uses Write-Ahead-Log (WAL) to provide durability against crashes. But it also makes restarts of Prometheus slow when you hit a decent scale because replaying Checkpoint+WAL takes time.</p>
<p>In this post we will understand more about a new feature introduced in Prometheus v2.30.0: taking snapshots of in-memory data during the shutdown for faster restarts by entirely skipping the WAL replay.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="about-snapshot">About snapshot<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#about-snapshot" class="hash-link" aria-label="Direct link to About snapshot" title="Direct link to About snapshot">​</a></h2>
<p>Snapshot in TSDB is a read-only static view of in-memory data of TSDB at a given time.</p>
<p>The snapshot contains the following (in order):</p>
<ol>
<li>All the time series and the in-memory chunk of each series present in the Head block. (<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Part 3</a> recap: except the last chunk, everything else is already on disk and m-mapped).</li>
<li>All the tombstones in the Head block.</li>
<li>All the exemplars in the Head block.</li>
</ol>
<p>Taking inspiration from the checkpoints in part 2, we name these snapshots as <code>chunk_snapshot.X.Y</code>, where <code>X</code> is the last WAL segment number that was seen when taking the snapshot, and <code>Y</code> is the byte offset up to which the data was written in the <code>X</code> WAL segment.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">data</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├── 01EM6Q6A1YPX4G9TEB20J22B2R</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── chunks</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   |   ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   |   └── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── index</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── meta.json</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   └── tombstones</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├── chunk_snapshot.000005.12345</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   └── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├── chunks_head</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   └── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└── wal</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ├── checkpoint.000003</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   |   ├── 000000</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   |   └── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ├── 000004</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   └── 000005</span><br></span></code></pre></div></div>
<p>We take this snapshot when shutting down the TSDB after stopping all writes. If the TSDB was stopped abruptly, then no new snapshot is taken and the snapshot from the last graceful shutdown remains.</p>
<p>This feature is disabled by default and can be enabled with <code>--enable-feature=memory-snapshot-on-shutdown</code>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="snapshot-format">Snapshot format<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#snapshot-format" class="hash-link" aria-label="Direct link to Snapshot format" title="Direct link to Snapshot format">​</a></h2>
<p>Snapshot uses the <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#low-level-details-of-writing-to-and-reading-from-wal">generic WAL implementation of Prometheus TSDB</a> and defines 3 new record formats for the snapshots.</p>
<p>The order of records in the snapshot is always:</p>
<ol>
<li>Series record (&gt;=0): This record is a snapshot of a single series. One record is written per series in an unsorted fashion. It includes the metadata of the series and the in-memory chunk data if it exists.</li>
<li>Tombstones record (0-1): After all series are done, we write a tombstone record containing all the tombstones of the Head block. A single record is written into the snapshot containing all the tombstones.</li>
<li>Exemplar record (&gt;=0): At the end, we write one or more exemplar records while batching up the exemplars in each record. Exemplars are in the order they were written to the circular buffer.</li>
</ol>
<p>The format of these records can be found <a href="https://github.com/prometheus/prometheus/blob/main/tsdb/docs/format/memory_snapshot.md" target="_blank" rel="noopener noreferrer">here</a>, we won't be discussing them in this blog post.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="restoring-in-memory-state">Restoring in-memory state<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#restoring-in-memory-state" class="hash-link" aria-label="Direct link to Restoring in-memory state" title="Direct link to Restoring in-memory state">​</a></h2>
<p>With <code>chunk_snapshot.X.Y</code>, we can ignore the WAL before <code>Xth</code> segment's <code>Y</code> offset and only replay the WAL after that because the snapshot along with m-mapped chunks represents the replayed state until that point in the WAL.</p>
<p>Hence with snapshots enabled, the replay of data to restore the Head goes as follows:</p>
<ol>
<li>Iterate all the m-mapped chunks as described in <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#replaying-on-startup#replaying-on-startup">part 3</a> and build the map.</li>
<li>Iterate the series records from the latest <code>chunk_snapshot.X.Y</code> one by one. For each series record, re-create the series in the memory with the labels and the in-memory chunk in the record.</li>
</ol>
<p>Similar to handling the <code>Series</code> record in the WAL, we look for corresponding m-mapped chunks for this series reference and attach it to this series.
3. Read the tombstones record if any from the snapshot and restore it into the memory.
4. Iterate the exemplar records one by one if any and put it back into the circular buffer in the same order.
5. After replaying the m-mapped chunks and the snapshot, we continue the replay of the WAL from <code>Xth</code> segment's <code>Y</code> byte offset as usual. If there are WAL checkpoints numbered <code>&gt;=X</code>, we also replay the last checkpoint before replaying the WAL.</p>
<p>In majority cases (i.e. graceful shutdowns), there will be no WAL to be replayed since the snapshot is taken after stopping writes during the shutdown. There will be WAL/Checkpoint to be replayed if Prometheus happens to abruptly crash/shutdown.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="faster-restarts">Faster restarts<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#faster-restarts" class="hash-link" aria-label="Direct link to Faster restarts" title="Direct link to Faster restarts">​</a></h2>
<p>When we talk about restarts, it is not only the time taken to replay the data on disk to restore the memory state, but also the time taken to shutdown, because snapshotting now adds some delay to shutdown.</p>
<p>Writing snapshots takes time in the magnitude of seconds, and is usually under a minute for 1 million series. And replaying the checkpoint also takes time in the magnitude of seconds. While the WAL replay can take multiple minutes for the same number of series.</p>
<p>By skipping the WAL replay entirely during graceful restart, we have seen anywhere between 50-80% reduction in <em>restart</em> time.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="few-things-to-be-aware-of">Few things to be aware of<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#few-things-to-be-aware-of" class="hash-link" aria-label="Direct link to Few things to be aware of" title="Direct link to Few things to be aware of">​</a></h2>
<ul>
<li>Snapshot will take additional disk space when enabled and does not replace an existing thing.</li>
<li>Depending on how many series you have and the write speed of your disk, shutdown can take a little time. Therefore, set your <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced" target="_blank" rel="noopener noreferrer">pod termination grace period</a> (or equivalent) for Prometheus pod accordingly.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="pre-answering-some-questions">Pre-answering some questions<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#pre-answering-some-questions" class="hash-link" aria-label="Direct link to Pre-answering some questions" title="Direct link to Pre-answering some questions">​</a></h2>
<p><em>Why take snapshots only on shutdown?</em></p>
<p>When we look at the number of times a sample is written on the disk (or re-written during compaction), it is only a handful. If we take snapshots at intervals while Prometheus is running, this can increase the number of times a sample is written to disk by a big %, hence causing unnecessary write amplification. So we chose to go with the majority case of a graceful shutdown while a crash would read part of WAL depending on the last snapshot present on the disk.</p>
<p><em>Why do we still need WAL?</em></p>
<p>If Prometheus happens to crash due to various reasons, we need the WAL for durability since a snapshot cannot be taken. Additionally, <a href="https://prometheus.io/docs/practices/remote_write/" target="_blank" rel="noopener noreferrer">remote-write</a> depends on the WAL.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p>The code for taking the snapshot and reading it is present in <a href="https://github.com/prometheus/prometheus/blob/main/tsdb/head_wal.go" target="_blank" rel="noopener noreferrer"><code>tsdb/head_wal.go</code></a>.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
        </item>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 6): Compaction and Retention]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention</guid>
            <pubDate>Tue, 27 Jul 2021 00:00:00 GMT</pubDate>
            <description><![CDATA[How do we do maintainence of data blocks in Prometheus TSDB and how do we handle retention of data on disk]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>When Prometheus has created a bunch of blocks, we need to regularly perform maintenance on those blocks to make efficient use of the disk and keep the queries performant.</p>
<p>In this blog post, we are going to look at 2 topics, compaction and retention, which happen in the background when Prometheus is running.</p>
<p>If you have not read the earlier parts of this blog post series, now is a good time to check out <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">part 1</a> and <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">part 4</a> to understand this blog post better.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="compaction">Compaction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#compaction" class="hash-link" aria-label="Direct link to Compaction" title="Direct link to Compaction">​</a></h2>
<p>Compaction consists of writing a new block from one or more existing blocks (called the source blocks or parent blocks), and at the end, the source blocks are deleted and the new compacted block is used in place of those source blocks.</p>
<p>But why do we need compaction?</p>
<ol>
<li>As we saw in part 4, any deletions to the data are stored as tombstones in a separate file while the data still stays on disk. So when the tombstones are touching more than some % of the series, we need to remove that data from the disk.</li>
<li>With low enough churn, most of the data in the index in adjacent blocks (w.r.t. time) is going to be the same. So by compacting (merging) those adjacent blocks, we can deduplicate a large part of the index and hence save disk space.</li>
<li>When a query hits &gt;1 block, we have to merge the result we get from individual blocks and that can be a bit of overhead. By merging adjacent blocks, we prevent this overhead.</li>
<li>If there are overlapping blocks (overlapping w.r.t. time), querying them requires deduplication of samples between blocks which is significantly more expensive than just concatenating chunks from different blocks. Merging these overlapping blocks avoid the need for deduplication.</li>
</ol>
<p>Below are the two steps for single compaction to take place. Every minute we initiate a compaction cycle where we check for step-1 and only proceed to step-2 if step-1 was not empty. The compaction cycle runs these steps in a loop and exits when step-1 is empty.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="step-1-the-plan">Step 1: The "plan"<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#step-1-the-plan" class="hash-link" aria-label="Direct link to Step 1: The &quot;plan&quot;" title="Direct link to Step 1: The &quot;plan&quot;">​</a></h3>
<p>A "plan" is a list of blocks to be compacted together, picked based on the below conditions in order of priority (highest to lowest). The first condition that is satisfied generates a plan, hence only 1 condition per plan. When none of the conditions meet, the plan is empty.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="condition-1-overlapping-blocks">Condition 1: Overlapping blocks<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#condition-1-overlapping-blocks" class="hash-link" aria-label="Direct link to Condition 1: Overlapping blocks" title="Direct link to Condition 1: Overlapping blocks">​</a></h4>
<p>As we saw above, overlapping blocks can make queries slow. Moreover, Prometheus itself does not produce overlapping blocks, it's only possible if you backfill some data into Prometheus. So highest priority goes to removing the overlap and getting the state back to what Prometheus will produce.</p>
<p>The plan can consist &gt;2 blocks. Take this example:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">|---1---|</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">            |---2---|</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      |---3---|</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                  |---4---|</span><br></span></code></pre></div></div>
<p>While there are only 2 blocks per overlap, if you look closely, when we compact one overlap, let say 1 and 3, they together will eventually overlap with 2. So instead of going through multiple cycles to fix all the linked overlaps, the first pass will choose <code>[1 2 3 4]</code> as the plan and reduce the number of compactions.</p>
<p>Another example that produces a single plan <code>[1 2 3]</code></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">|-----1-----|</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  |--2--|</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     |----3----|  </span><br></span></code></pre></div></div>
<p>Note that overlapping blocks support is not enabled by default in Prometheus, it will error out on startup or runtime if you have overlapping blocks, unless enabled via <code>--storage.tsdb.allow-overlapping-blocks</code> flag.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="condition-2-preset-time-ranges">Condition 2: Preset time ranges<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#condition-2-preset-time-ranges" class="hash-link" aria-label="Direct link to Condition 2: Preset time ranges" title="Direct link to Condition 2: Preset time ranges">​</a></h4>
<p>In this, we pick &gt;1 block to merge to fill some preset time ranges. In Prometheus, by default, time ranges are <code>[2h 6h 18h 54h 162h 486h]</code>, i.e. starting at 2h with a multiple of 3.</p>
<p>Let's take an example of <code>6h</code> range. We divide the Unix time into buckets as <code>0-6h, 6h-12h, 12h-18h ...</code>, and if &gt;1 block falls into any single bucket, that forms a plan and we compact them together to form a block up to 6h long.</p>
<p>We also take care to not compact the newest blocks that do not span the entire bucket together yet. For example, the latest 2 blocks of 2h range won't be compacted together since they are (1) new (2) do not span 6h combined. Since Prometheus produces 2h blocks, when we have &gt;=3 blocks, the blocks falling into the same buckets are compacted together.</p>
<p>Similarly, we check all ranges to see if there is any time bucket that has &gt;1 block falling in it. At the end of the compaction cycle, there will be no time bucket with &gt;1 block for all ranges.</p>
<p>In Prometheus, the maximum size of a block can be either <code>31d</code> (i.e. <code>744h</code>), or 1/10th of the retention time, whichever is lower.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="condition-3-tombstones-covering-some--of-series">Condition 3: Tombstones covering some % of series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#condition-3-tombstones-covering-some--of-series" class="hash-link" aria-label="Direct link to Condition 3: Tombstones covering some % of series" title="Direct link to Condition 3: Tombstones covering some % of series">​</a></h4>
<p>In the end, if any block has tombstones touching &gt;5% of the total series in the block, we pick that for compaction where the data pointed out by tombstones is deleted from the disk (by creating a new block with no samples covered by the tombstones). This produces a plan with only 1 block.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="step-2-the-compaction-itself">Step 2: The compaction itself<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#step-2-the-compaction-itself" class="hash-link" aria-label="Direct link to Step 2: The compaction itself" title="Direct link to Step 2: The compaction itself">​</a></h3>
<p>As we saw in part 4, persistent blocks are immutable. To do any changes, we have to write a new block. Similarly, in compaction, we write an entirely new block, even if it is compaction of a single block. The compaction step only receives the list of blocks to compact together into a single block and is ignorant about the logic used to create this plan.</p>
<p>The compaction logic has been evolving with time with various memory management techniques and faster merging of data. At a higher level, compaction does an N way merge of the series from the source block while iterating through series one by one in a sorted fashion (the order in which they appear in index too).</p>
<p>While the series is deduplicated in the index, when the blocks are not overlapping, the chunks are concatenated together from source blocks. If blocks are overlapping, only the overlapping chunks are uncompressed, samples are deduped (i.e. only keep 1 sample for matching timestamp), and compressed back into &gt;=1 chunk while keeping the max size of chunk to 120 samples.</p>
<p>If there are tombstones in any of the blocks, the chunks of those series are re-written to exclude the time ranges mentioned in the tombstones. The final block won't have any tombstones.</p>
<p>Every compacted block is given a compaction level, which tells the generation of the block, i.e. number of times blocks have been compacted to get this one. It is <code>max(level of source blocks) + 1</code> for the new block.</p>
<p>If all samples of a series are deleted, then the series is skipped from the new block entirely. If the block has 0 samples (i.e. empty block), then no block is written to the disk while the source blocks are deleted.</p>
<p>Note that compaction itself does not delete the source blocks, but only marks them as deletable (in their <code>meta.json</code>). The loading of new blocks and deletion of source blocks is handled by the TSDB separately after the compaction cycle has ended.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="head-compaction">Head compaction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#head-compaction" class="hash-link" aria-label="Direct link to Head compaction" title="Direct link to Head compaction">​</a></h2>
<p>This is a special kind of compaction where the source is the Head block and the compaction persists part of the Head block into persistent blocks while removing any data pointed by tombstones.</p>
<p>Part 1 has an illustration and explanation of when the Head compaction is done. Head block implements the same interface as that of a persistent block reader, hence we use the same compaction code to also compact the Head block into a persistent block.</p>
<p>The block produced from the Head block has compaction level 1.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="retention">Retention<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#retention" class="hash-link" aria-label="Direct link to Retention" title="Direct link to Retention">​</a></h2>
<p>TSDB allows setting retention policies to limit how much data you store in it. There are 2 of them, time-based and size-based retention. You can either set one of them or both of them. When you set both of them, it is a <code>OR</code> between them, i.e. the first one to satisfy will trigger the deletion of relevant data.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="time-based-retention">Time based retention<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#time-based-retention" class="hash-link" aria-label="Direct link to Time based retention" title="Direct link to Time based retention">​</a></h3>
<p>In this, you mention how long should the data span in the TSDB. It is a relative time span calculated w.r.t. the max time of the newest persistent block (and not w.r.t. the Head block). A block is deleted when it goes completely beyond the time retention period and not when part of the block goes beyond the time retention.</p>
<p>For example, if the retention period is <code>15d</code>, as soon as the gap between the oldest block's max time and the newest block's max time goes beyond <code>15d</code>, the oldest block is deleted.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="size-based-retention">Size based retention<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#size-based-retention" class="hash-link" aria-label="Direct link to Size based retention" title="Direct link to Size based retention">​</a></h3>
<p>In this, you mention the max size of the TSDB on disk. It includes the WAL, checkpoint, m-mapped chunks, and persistent blocks. Although we count all of them to decide any deletion, WAL, checkpoint, and m-mapped chunks are required for the normal operation of TSDB. So even if they together go beyond the size retention, only the blocks are the ones that are deleted. So TSDB may take more than the specified max size if you set it too low.</p>
<p>Size-based retention is stricter compared to time-based retention. As soon as the entire space taken is at least 1 byte more than the max size, the oldest block is deleted.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/compact.go" target="_blank" rel="noopener noreferrer"><code>tsdb/compact.go</code></a> has the code for the creation of plan and compacting the blocks.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/main/storage/merge.go" target="_blank" rel="noopener noreferrer"><code>storage/merge.go</code></a> has the code for concatenating/merging the chunks from different blocks (both for overlapping and non-overlapping chunks).</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/db.go" target="_blank" rel="noopener noreferrer"><code>tsdb/db.go</code></a> has the code for initiating the compaction cycle every minute and calling the step-1 &amp; step-2 on blocks and compaction of the Head block. It also has the code for both types of retention.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
            <category>Compaction</category>
        </item>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 5): Queries]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-queries</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-queries</guid>
            <pubDate>Mon, 04 Jan 2021 00:00:00 GMT</pubDate>
            <description><![CDATA[Internals of querying Prometheus TSDB.]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>In the last four blog posts we saw the internals of how data is stored in the TSDB. It's now time to know how to query it. In this blog post we will be looking at 3 types of query that we do on the persistent blocks and briefly about the Head block.</p>
<p><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Part 4</a> is a prerequisite for this blog post which talks about how data is stored in persistent blocks. Here are part <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">1</a>, <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">2</a> and <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">3</a> in case you missed them.</p>
<p>[Edit 2021-01-16]: Some details about how the negation matchers work were updated.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="prologue">Prologue<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#prologue" class="hash-link" aria-label="Direct link to Prologue" title="Direct link to Prologue">​</a></h2>
<p>Don't confuse this querying with PromQL queries. In this blog post we will see the low level TSDB queries used to get the raw data from the TSDB. PromQL engine performs these TSDB queries to get the raw data and execute PromQL logic on it. So we are working at a layer lower than PromQL engine.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="types-of-tsdb-queries">Types of TSDB Queries<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#types-of-tsdb-queries" class="hash-link" aria-label="Direct link to Types of TSDB Queries" title="Direct link to Types of TSDB Queries">​</a></h2>
<p>There are 3 types of queries that we run on persistent blocks at the time of writing this blog post.</p>
<ol>
<li><code>LabelNames()</code>: returns all unique label names present in the block.</li>
<li><code>LabelValues(name)</code>: returns all the possible label values for the label name <code>name</code> as seen in the index.</li>
<li><code>Select([]matcher)</code>: returns the samples for the given slice of matchers for the series. We will talk more about these matchers later.</li>
</ol>
<p>Before we run any query on the block, we create something called a <code>Querier</code> on the block which has the min time (<code>mint</code>) and max time (<code>maxt</code>) for the query to be run. This <code>mint</code> and <code>maxt</code> is only applicable to the <code>Select</code> query while the other two always look at all the values in the block.</p>
<p>We will discuss how we combine results from multiple blocks after looking at all 3 query types.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="labelnames"><code>LabelNames()</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#labelnames" class="hash-link" aria-label="Direct link to labelnames" title="Direct link to labelnames">​</a></h2>
<p>This returns all the unique label names present in the block. To recap, in the series <code>{a="b", c="d"}</code>, the label names are <code>"a"</code> and <code>"c"</code>.</p>
<p>In Part 4 it was mentioned that the <code>Label Offset Table</code> was no longer used and is being written only for backward compatibility. Hence both <code>LabelNames()</code> and <code>LabelValues()</code> use <code>Postings Offset Table</code>.</p>
<p>When the index of the block is loaded on startup (or block creation), we store map <em><code>map[labelName][]postingOffset</code></em> of label name to a list of <em>some</em> label value's position in the postings offset table (every 32nd at the moment, including the first and the last label value). Storing only some of the value helps in saving memory. This map is created by iterating through all the entries in <code>Postings Offset Table</code> when loading the block.</p>
<p>You can now imagine how we can get the label names - just iterate this in-memory map for its keys and there you have the label names. They are sorted before returning. This is useful for query autocomplete suggestions on UI.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="labelvaluesname"><code>LabelValues(name)</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#labelvaluesname" class="hash-link" aria-label="Direct link to labelvaluesname" title="Direct link to labelvaluesname">​</a></h2>
<p>We saw above that we store positions of the first and the last label value in the memory for all label names. Hence for <code>LabelValues(name)</code> query, we take the first and last label value position for the given <code>name</code> and iterate on the disk between those two positions to get all the label values for that label name. Another recap here: all the label values for a label name are stored together lexicographically in <code>Postings Offset Table</code>.</p>
<p>For example if the series in the block were <code>{a="b1", c="d1"}</code>, <code>{a="b2", c="d2"}</code> and <code>{a="b3", c="d3"}</code>, then <code>LabelValues("a")</code> would yield <code>["b1", "b2", "b3"]</code>, <code>LabelValues("c")</code> would yield <code>["d1", "d2", "d3"]</code>.</p>
<p>This again helps in query autocomplete suggestions.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="selectmatcher"><code>Select([]matcher)</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#selectmatcher" class="hash-link" aria-label="Direct link to selectmatcher" title="Direct link to selectmatcher">​</a></h2>
<p>This query helps in getting the raw TSDB samples from the series described by the given matchers. Before we talk about this query, we need to know what are matchers.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="matcher">Matcher<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#matcher" class="hash-link" aria-label="Direct link to Matcher" title="Direct link to Matcher">​</a></h3>
<p>A matcher tells the label name value combination that should match in a series. For example, a matcher <code>a="b"</code> says pick all the series which has the label pair <code>a="b"</code>.</p>
<p>There are 4 types of matchers</p>
<ol>
<li>Equal <code>labelName="&lt;value&gt;"</code>: the label name should exactly match the given value.</li>
<li>Not Equal <code>labelName!="&lt;value&gt;"</code>: the label name should not exactly match the given value.</li>
<li>Regex Equal <code>labelName=~"&lt;regex&gt;"</code>: the label value for the label name should satisfy the given regex.</li>
<li>Regex Not Equal <code>labelName!~"&lt;regex&gt;"</code>: the label value for the label name should not satisfy the given regex.</li>
</ol>
<p>The <code>labelName</code> is the full label name and no regex is allowed there. The regex matchers should match the entire label value and not partially since it is anchored with <code>^(?:&lt;regex&gt;)$</code> before using.</p>
<p>Let's say the series are</p>
<ul>
<li>s1 = <code>{job="app1", status="404"}</code></li>
<li>s2 = <code>{job="app2", status="501"}</code></li>
<li>s3 = <code>{job="bar1", status="402"}</code></li>
<li>s4 = <code>{job="bar2", status="501"}</code></li>
</ul>
<p>Here are some matcher examples</p>
<ul>
<li><code>status="501"</code> -&gt; (s2, s4)</li>
<li><code>status!="501"</code> -&gt; (s1, s3)</li>
<li><code>job=~"app.*"</code> -&gt; (s1, s2)</li>
<li><code>job!~"app.*"</code> -&gt; (s3, s4)</li>
</ul>
<p>And when there are &gt;1 matchers, it is an AND operation (i.e. intersection) between all the matchers.</p>
<ul>
<li><code>job=~"app.*", status="501"</code> -&gt; (s1, s2) ∩ (s2, s4) -&gt; (s2)</li>
<li><code>job=~"bar.*", status!~"5.."</code> -&gt; (s3, s4) ∩ (s1, s3) -&gt; (s3)</li>
</ul>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="selecting-samples">Selecting samples<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#selecting-samples" class="hash-link" aria-label="Direct link to Selecting samples" title="Direct link to Selecting samples">​</a></h3>
<p>First step is to get the series that the matchers match. We need to get all the series for individual matchers and then finally intersect them.</p>
<p>We saw in part 4 that a "posting" is the series ID which tells us the position of series info in the index. <code>Postings Offset Table</code> and <code>Postings i</code> together give all the postings for a label-value pair.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="getting-postings-for-a-single-matcher">Getting postings for a single matcher<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#getting-postings-for-a-single-matcher" class="hash-link" aria-label="Direct link to Getting postings for a single matcher" title="Direct link to Getting postings for a single matcher">​</a></h4>
<p>If it is an Equal matcher, say <code>a="b"</code>, we directly get the postings list position for that from the postings offset table. Since we store positions for only some of the label values for a name, we get the two values between which <code>"b"</code> falls for label name <code>a</code> and iterate the entries between them till we find <code>"b"</code>. The <code>a="b"</code> entry in the offset table points to a postings list which is all the series ids that contain <code>a="b"</code>. If there is no such entry in the offset table, then it's an empty list of postings for the matcher.</p>
<p>For Regex Equal <code>a=~"&lt;rgx&gt;"</code>, we have to iterate through all the label values of <code>a</code> in the <code>Postings Offset Table</code> and check for the matcher condition. We take the postings list of all the matched entries and merge it (union) to get the sorted postings list for this matcher. Taking an example of <code>job=~"app.*"</code> from above, we find <code>job="app1" -&gt; (s1)</code> and <code>job="app2" -&gt; (s2)</code>, and after merging we have <code>job=~"app.*" -&gt; (s1, s2)</code>.</p>
<p>With Not Equal <code>a!="b"</code> and Regex Not Equal <code>a!~"&lt;rgx&gt;"</code>, it is a little different in how we internally use it. We get Equal and Regex Equal for corresponding Not Equal and Regex Not Equal (i.e. <code>a!="b"</code> becomes <code>a="b"</code>and <code>a!~"&lt;rgx&gt;"</code> becomes <code>a=~"&lt;rgx&gt;"</code>) since getting everything that does not match can be pretty huge in practice. Because of this, you cannot use a standalone negation matcher in a query, <em>you need to have at least one Equal or Regex Equal matcher</em>. We take these postings after conversion and do a set subtraction instead. See below for example.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="postings-for-multiple-matchers">Postings for multiple matchers<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#postings-for-multiple-matchers" class="hash-link" aria-label="Direct link to Postings for multiple matchers" title="Direct link to Postings for multiple matchers">​</a></h4>
<p>Using the above procedure we first get the postings list for all individual matchers. And, similar to what we discussed about matchers before, we intersect them to finally get the postings list (series) that satisfy all the matchers. Note the change in set operation when we have a negation matcher.</p>
<p><code>job=~"bar.*", status!~"5.*"</code></p>
<p>-&gt; <code>(job=~"bar.*") ∩ (status!~"5.*")</code></p>
<p>-&gt; <code>(job=~"bar.*") - (status=~"5.*")</code></p>
<p>-&gt; <code>((job="bar1") ∪ (job="bar2")) - (status="501")</code></p>
<p>-&gt; <code>((s3) ∪ (s4)) - (s2, s4)</code></p>
<p>-&gt; <code>(s3, s4) - (s2, s4)</code> -&gt; <code>(s3)</code></p>
<p>Similarly, if the matchers were <code>a="b", c!="d", e=~"f.*", g!~"h.*"</code>, then the set operations would be <code>((a="b") ∩ (e=~"f.*")) - (c="d") - (g=~"h.*")</code>.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="getting-the-samples-finally">Getting the samples finally<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#getting-the-samples-finally" class="hash-link" aria-label="Direct link to Getting the samples finally" title="Direct link to Getting the samples finally">​</a></h4>
<p>Once we have all the series ids (postings) for the matchers, we simply go through those one by one and do the following</p>
<ol>
<li>Go to the series in the <code>Series</code> table represented by the series id.</li>
<li>Pick all the chunk references from that series which overlap with the time range <code>mint</code> through <code>maxt</code> specified by the querier.</li>
<li>Create an iterator to iterate over these chunks from the <code>chunks</code> directory for samples between <code>mint</code> and <code>maxt</code>.</li>
</ol>
<p><code>Select([]matcher)</code> finally returns sample iterators for all the series that matches the matchers. The series are sorted w.r.t. their label pairs.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="some-implementation-details">Some Implementation Details<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#some-implementation-details" class="hash-link" aria-label="Direct link to Some Implementation Details" title="Direct link to Some Implementation Details">​</a></h2>
<ul>
<li>When getting the postings for a matcher, all the postings for all the matching entries are not got into the memory at the same time. Since the index is memory-mapped from disk, the postings are lazily iterated and merged to get the final list.</li>
<li>All the sample iterators for all series are not returned upfront by <code>Select([]matcher)</code>; there could be 100s of thousands of series as the result. They follow a similar fashion as above. An iterator is returned which iterates over the series one by one giving its sample iterator. And the sample iterator also lazily loads the chunks when asked for.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="querying-multiple-blocks">Querying multiple blocks<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#querying-multiple-blocks" class="hash-link" aria-label="Direct link to Querying multiple blocks" title="Direct link to Querying multiple blocks">​</a></h2>
<p>When you have multiple blocks overlapping with the <code>mint</code> through <code>maxt</code> of the querier, the querier is actually a merge querier which holds queriers for individual blocks. The 3 queries now effectively do the following:</p>
<ol>
<li><code>LabelNames()</code>: get the sorted label names from all blocks and do a N way merge.</li>
<li><code>LabelValues(name)</code>: get the label values from all the blocks and do a N way merge.</li>
<li><code>Select([]matcher)</code>: get the series iterator from all the blocks using the Select method and do a lazy N way merge again in an iterator fashion. This is feasible since the individual series iterators return series in sorted order w.r.t. label pairs.</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="querying-head-block">Querying Head block<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#querying-head-block" class="hash-link" aria-label="Direct link to Querying Head block" title="Direct link to Querying Head block">​</a></h2>
<p>The Head block stores the entire map of label-value pairs and all the postings list in the memory (an example Go representation <code>map[labelName]map[labelValue]postingsList</code>), hence there is no special care required in accessing them. The remaining procedure for performing the 3 queries remains the same with the map and the postings list.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/index/index.go" target="_blank" rel="noopener noreferrer"><code>tsdb/index/index.go</code></a> has the code for performing the <code>LabelNames()</code> and <code>LabelValues(name)</code> queries on the persistent block and also for getting the merged postings list for given label name and values (not the matcher itself).</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/querier.go" target="_blank" rel="noopener noreferrer"><code>tsdb/querier.go</code></a> has the code for performing the <code>Select([]matcher)</code> query on the persistent block including filtering the label values for the matchers before asking the index for postings list. <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/chunks/chunks.go" target="_blank" rel="noopener noreferrer"><code>tsdb/chunks/chunks.go</code></a> has the code for getting the chunks from the disk.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/head.go" target="_blank" rel="noopener noreferrer"><code>tsdb/head.go</code></a> has the code for performing all 3 queries on the Head block.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/db.go" target="_blank" rel="noopener noreferrer"><code>tsdb/db.go</code></a> and <a href="https://github.com/prometheus/prometheus/blob/master/storage/merge.go" target="_blank" rel="noopener noreferrer"><code>storage/merge.go</code></a> have the code for the merged querier when there are multiple blocks involved in the query.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
        </item>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 4): Persistent Block and its Index]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index</guid>
            <pubDate>Sun, 18 Oct 2020 00:00:00 GMT</pubDate>
            <description><![CDATA[What is a persistent block in TSDB, when is it created, what does it contain and details about it's index.]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>In <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a>, <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Part 2</a>, and <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Part 3</a>, we have covered most of the things related to the Head (in-memory) block (i.e. at the time of writing this post, more things to come in Head). In this blog post, we will dive deeper into the persistent blocks which reside on disk.</p>
<p>There is a lot of information to digest here, so sit back and relax, and maybe grab a coffee.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="whats-a-persistent-block-and-when-is-it-created">What's a persistent block and when is it created<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#whats-a-persistent-block-and-when-is-it-created" class="hash-link" aria-label="Direct link to What's a persistent block and when is it created" title="Direct link to What's a persistent block and when is it created">​</a></h2>
<p>A block on disk is a collection of chunks for a fixed time range consisting of its own index. It is a directory with multiple files inside it. Every block has a unique ID, which is a <a href="https://github.com/oklog/ulid" target="_blank" rel="noopener noreferrer">Universally Unique Lexicographically Sortable Identifier (ULID)</a>.</p>
<p>A block has an interesting property that the samples in it are immutable. If you want to add more samples, or delete some, or update some, you have to rewrite the entire block with the required modifications and the new block has a new ID. There is no relationship between these 2 blocks. We have deletions on blocks via tombstones while not touching the samples, since re-writing a block on every delete request does not sound sane; we will discuss more about it in this blog post.</p>
<p>We saw in <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a> that when the Head block fills up with data ranging <code>chunkRange*3/2</code> in time, we take the first <code>chunkRange</code> of data and convert into a persistent block.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb8-2143f3ae9296366a5998fb78ee2320d1.svg" width="871" height="397" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb9-73e001cb1662df81b619a2bafc33351d.svg" width="871" height="397" class="img_ev3q"></p>
<p>Here we call that <code>chunkRange</code> as <code>blockRange</code> in the context of blocks, and the first block cut from the Head spans <code>2h</code> by default in Prometheus.</p>
<p>Looking at the overall picture of TSDB below</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb1-9dce57fbe455a6163a84d68c9c73c7dd.svg" width="680" height="149" class="img_ev3q"></p>
<p>When the blocks get old, multiple blocks are compacted (or merged) to form a new bigger block while the old ones are deleted. So we have 2 ways of creating a block, from the Head and from existing blocks. We will look into compaction in future blog posts.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="contents-of-a-block">Contents of a block<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#contents-of-a-block" class="hash-link" aria-label="Direct link to Contents of a block" title="Direct link to Contents of a block">​</a></h2>
<p>A block consists of 4 parts</p>
<ol>
<li><code>meta.json</code> (file): the metadata of the block.</li>
<li><code>chunks</code> (directory): contains the raw chunks without any metadata about the chunks.</li>
<li><code>index</code> (file): the index of this block.</li>
<li><code>tombstones</code> (file): deletion markers to exclude samples when querying the block.</li>
</ol>
<p>With <code>01EM6Q6A1YPX4G9TEB20J22B2R</code> as an example of block ID, here is how the files look on disk</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">data</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├── 01EM6Q6A1YPX4G9TEB20J22B2R</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── chunks</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   |   ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   |   └── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── index</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── meta.json</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   └── tombstones</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├── chunks_head</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   └── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└── wal</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── checkpoint.000003</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    |   ├── 000000</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    |   └── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000004</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    └── 000005</span><br></span></code></pre></div></div>
<p>Let's dive deeper into each one of them.</p>
<hr>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-metajson">1. <code>meta.json</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#1-metajson" class="hash-link" aria-label="Direct link to 1-metajson" title="Direct link to 1-metajson">​</a></h3>
<p>This contains all the required metadata for the block as a whole. Here is an example:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	"ulid": "01EM6Q6A1YPX4G9TEB20J22B2R",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	"minTime": 1602237600000,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	"maxTime": 1602244800000,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	"stats": {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">		"numSamples": 553673232,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">		"numSeries": 1346066,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">		"numChunks": 4440437</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	},</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	"compaction": {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">		"level": 1,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">		"sources": [</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">			"01EM65SHSX4VARXBBHBF0M0FDS",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">			"01EM6GAJSYWSQQRDY782EA5ZPN"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">		]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	},</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">	"version": 1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p><code>version</code> tells us how to parse the meta file.</p>
<p>Though the directory name is set to the ULID, only the one present in the <code>meta.json</code> as <code>ulid</code> is the valid ID, the directory name can be anything.</p>
<p><code>minTime</code> and <code>maxTime</code> is the absolute minumum and maximum timestamp among all the chunks present in the block.</p>
<p><code>stats</code> tell the number of series, samples, and chunks present in the block.</p>
<p><code>compaction</code> tells the history of the block. <code>level</code> tells how many generations has this block seen. <code>sources</code> tell from which blocks was this block created (i.e. block which were merged to form this block). If it was created from Head block, then the <code>sources</code> is set to itself (<code>01EM6Q6A1YPX4G9TEB20J22B2R</code> in this case).</p>
<hr>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-chunks">2. <code>chunks</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#2-chunks" class="hash-link" aria-label="Direct link to 2-chunks" title="Direct link to 2-chunks">​</a></h3>
<p>The <code>chunks</code> directory contains a sequence of numbered files similar to the WAL/checkpoint/head chunks. Each file is capped at 512MiB. This is the format of an individual file inside this directory:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌──────────────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│  magic(0x85BD40DD) &lt;4 byte&gt;  │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│    version(1) &lt;1 byte&gt;       │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│    padding(0) &lt;3 byte&gt;       │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌──────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │         Chunk 1          │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │          ...             │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │         Chunk N          │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └──────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└──────────────────────────────┘</span><br></span></code></pre></div></div>
<p>It looks very similar to <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/#the-file">the memory-mapped head chunks file</a>. The <code>magic</code> number identifies this file as a chunks file. <code>version</code> tells us how to parse this file. <code>padding</code> is for any future headers. This is then followed by a list of chunks.</p>
<p>Here is the format of an indivudual chunk:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌───────────────┬───────────────────┬──────────────┬────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;uvarint&gt; │ encoding &lt;1 byte&gt; │ data &lt;bytes&gt; │ CRC32 &lt;4 byte&gt; │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└───────────────┴───────────────────┴──────────────┴────────────────┘</span><br></span></code></pre></div></div>
<p>It again looks similar to <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/#chunks">the memory-mapped head chunks on disk</a> except that it is missing the <code>series ref</code>, <code>mint</code> and <code>maxt</code>. We needed this additional information for the Head chunks to recreate the in-memory index during startup. But in the case of blocks, we have this additonal information in the <code>index</code>, because index is the place where it finally belongs, hence we don't need it here.</p>
<p>To access these chunks, we again need the chunk reference that we talked in <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Part 3</a>. Repeating what I had said: The reference is 8 bytes long. The first 4 bytes tell the file number in which the chunk exists, and the last 4 bytes tell the offset in the file where the chunk starts (i.e. the first byte of the <code>len</code>). If the chunk was in the file <code>00093</code> and the <code>len</code> of the chunk starts at byte offset <code>1234</code> in the file, then the reference of that chunk would be <code>(92 &lt;&lt; 32) | 1234</code> (left shift bits and then bitwise OR). While the file names use 1 based indexing, the chunks references use 0 based indexing. Hence <code>00093</code> got converted to <code>92</code> when calculating the chunk reference.</p>
<p>Here is the <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/chunks.md" target="_blank" rel="noopener noreferrer">link</a> for the upstream docs on the <code>chunks</code> format.</p>
<hr>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-index">3. <code>index</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#3-index" class="hash-link" aria-label="Direct link to 3-index" title="Direct link to 3-index">​</a></h3>
<p>Index contains all that you need to query the data of this block. It does not share any data with any other blocks or external entity which makes it possible to read/query the block without any dependencies.</p>
<p>The index is an "inverted index" which is also widely used in indexing documents. Fabian talks more about inverted index in <a href="https://web.archive.org/web/20220205173824/https://fabxc.org/tsdb/" target="_blank" rel="noopener noreferrer">his blog post</a>, hence I am skipping that topic here since this post is too long already.</p>
<p>Here is the high level view of the index which we will dive into shortly.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────────────────────────────┬─────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ magic(0xBAAAD700) &lt;4b&gt;     │ version(1) &lt;1 byte&gt; │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├────────────────────────────┴─────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌──────────────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                 Symbol Table                 │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                    Series                    │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                 Label Index 1                │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                      ...                     │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                 Label Index N                │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                   Postings 1                 │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                      ...                     │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                   Postings N                 │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │              Label Offset Table              │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │             Postings Offset Table            │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                      TOC                     │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └──────────────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└──────────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>Same as other files, the <code>magic</code> number identifies this file as an index file. <code>version</code> tells us how to parse this file. The entry point to this index is the <code>TOC</code>, which stands for Table Of Contents. So we will first start from <code>TOC</code> and learn about other parts of the index.</p>
<hr>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="a-toc">A. <code>TOC</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#a-toc" class="hash-link" aria-label="Direct link to a-toc" title="Direct link to a-toc">​</a></h4>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌─────────────────────────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ref(symbols) &lt;8b&gt;                       │ -&gt; Symbol Table</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ref(series) &lt;8b&gt;                        │ -&gt; Series</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ref(label indices start) &lt;8b&gt;           │ -&gt; Label Index 1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ref(label offset table) &lt;8b&gt;            │ -&gt; Label Offset Table</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ref(postings start) &lt;8b&gt;                │ -&gt; Postings 1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ref(postings offset table) &lt;8b&gt;         │ -&gt; Postings Offset Table</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ CRC32 &lt;4b&gt;                              │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└─────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>It tells us where exactly (the byte offset in the file) do the individual components of the index start. I have marked what do each reference point to in the index format above. The starting point of next component also tell us where do individual components end. If any of the reference is <code>0</code>, it indicates that the corresponding section does not exist in the index, and hence should be skipped while reading.</p>
<p>Since <code>TOC</code> is fixed size, the last 52 bytes of the file can be taken as the <code>TOC</code>.</p>
<p>As you will notice in the coming sections, each component will have its own checksum, i.e. <code>CRC32</code> to check for the integrity of the underlying data.</p>
<hr>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="b-symbol-table">B. <code>Symbol Table</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#b-symbol-table" class="hash-link" aria-label="Direct link to b-symbol-table" title="Direct link to b-symbol-table">​</a></h4>
<p>This section holds a sorted list of deduplicated strings which are found in label pairs of all the series in this block. For example if the series is <code>{a="y", x="b"}</code>, then the symbols would be <code>"a", "b", "x", "y"</code>.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────────────────────┬─────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;4b&gt;           │ #symbols &lt;4b&gt;       │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├────────────────────┴─────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌──────────────────────┬───────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ len(str_1) &lt;uvarint&gt; │ str_1 &lt;bytes&gt; │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┴───────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                . . .                 │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┬───────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ len(str_n) &lt;uvarint&gt; │ str_n &lt;bytes&gt; │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └──────────────────────┴───────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ CRC32 &lt;4b&gt;                               │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└──────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>The <code>len &lt;4b&gt;</code> is the number of bytes in this section and <code>#symbols</code> is the number of symbols. It is followed by <code>#symbols</code> number of utf-8 encoded strings, where each string has its length prefixed followed by the raw bytes of the string. Checksum (<code>CRC32</code>) for integrity.</p>
<p>The other sections in the index can refer to this symbol table for any strings and hence significantly reduce the index size. The byte offset at which the symbol starts in the file (i.e. the start of <code>len(str_i)</code>) forms the reference for the corresponding symbol which can be used in other places instead of the actual string. When you want the actual string, you can use the offset to get it from this table.</p>
<hr>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="c-series">C. <code>Series</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#c-series" class="hash-link" aria-label="Direct link to c-series" title="Direct link to c-series">​</a></h4>
<p>This section contains a list of all the series information present in this blocks. The series are sorted lexicographically by their label sets.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌───────────────────────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌───────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │   series_1                        │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├───────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                 . . .             │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├───────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │   series_n                        │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └───────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└───────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>Each series entry is 16 byte aligned, which means the byte offset at which the series starts is divisible by 16. Hence we set the ID of the series to be <code>offset/16</code> where offset points to the start of the series entry. This ID is used to reference this series and whenever you want to access the series, you can get the location in the index by doing <code>ID*16</code>.</p>
<p>Since the series are lexicographically sorted by their label sets, a sorted list of series IDs implies a sorted list of series label sets.</p>
<p>Here comes a confusing part for many in the index: what is a <strong><em>posting</em></strong>? The above series ID <em>is a posting</em>. So whenever we say posting in the context of Prometheus TSDB, it refers to a series ID. But why posting? Here is my best guess: in the world of indexing the documents and its words with an inverted index, the document IDs are usually called a "posting" in the index. Here you can consider a series to be a document and a label-value pair of a series to be words in the document. Series ID -&gt; document ID, document ID -&gt; posting, series ID -&gt; posting.</p>
<p>Each entry holds the label set of the series and references to all the chunks belonging to this series (the reference is the one from the <code>chunks</code> directory).</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌──────────────────────────────────────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;uvarint&gt;                                        │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌──────────────────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │            labels count &lt;uvarint64&gt;              │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ┌────────────────────────────────────────────┐  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ ref(l_i.name) &lt;uvarint32&gt;                  │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ├────────────────────────────────────────────┤  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ ref(l_i.value) &lt;uvarint32&gt;                 │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  └────────────────────────────────────────────┘  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                       ...                        │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │            chunks count &lt;uvarint64&gt;              │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ┌────────────────────────────────────────────┐  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ c_0.mint &lt;varint64&gt;                        │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ├────────────────────────────────────────────┤  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ c_0.maxt - c_0.mint &lt;uvarint64&gt;            │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ├────────────────────────────────────────────┤  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ ref(c_0.data) &lt;uvarint64&gt;                  │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  └────────────────────────────────────────────┘  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ┌────────────────────────────────────────────┐  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ c_i.mint - c_i-1.maxt &lt;uvarint64&gt;          │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ├────────────────────────────────────────────┤  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ c_i.maxt - c_i.mint &lt;uvarint64&gt;            │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  ├────────────────────────────────────────────┤  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  │ ref(c_i.data) - ref(c_i-1.data) &lt;varint64&gt; │  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  └────────────────────────────────────────────┘  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                       ...                        │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └──────────────────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ CRC32 &lt;4b&gt;                                           │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└──────────────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>The starting <code>len</code> and ending <code>CRC32</code> is same as before. The series entry starts with number of label-value pairs present in the series, as <code>labels count</code>, followed by lexicographically ordered (w.r.t. label name) label-value pairs, Instead of storing the actual string itself, we use the symbol reference here from the symbol table. If the series was <code>{a="y", x="b"}</code>, the series entry for it would include symbol reference for <code>"a", "y", "x", "b"</code> in the same order.</p>
<p>Next is the number of chunks (<code>chunks count</code>) that belongs to this series in the <code>chunks</code> directory. And this is followed by a sequence of metadata about the indexed chunks containing the min time (timestamp of first sample) and max time (timestamp of last sample) of the chunk and its reference in the <code>chunks</code> directory. These are sorted by the <code>mint</code> of the chunks. If you noticed the above format, we are actually storing <code>mint</code> and <code>maxt</code> by taking the different with the previous timestamp (mint of the same chunk or maxt of previous chunk). This reduces the size of the chunk metadata since these form a huge part of the index by size.</p>
<p>Holding the <code>mint</code> and <code>maxt</code> in the index allows queries to skip the chunks which are not required for the queried time range. This is different from the m-mapped Head chunks from disk where <code>mint</code> and <code>maxt</code> are with the chunks to restore them in the in-memory index of Head during startup.</p>
<hr>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="d-label-offset-table-and-label-index-i">D. <code>Label Offset Table</code> and <code>Label Index i</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#d-label-offset-table-and-label-index-i" class="hash-link" aria-label="Direct link to d-label-offset-table-and-label-index-i" title="Direct link to d-label-offset-table-and-label-index-i">​</a></h4>
<p>Both of these are coupled, so we will discuss both together. <code>Label Index i</code> refers to any of <code>Label Index 1 ... Label Index N</code> in the index; we will talk about a single entry <code>Label Index i</code>.</p>
<p>These two are <em>not used anymore</em>; they are <em>written</em> for backward compatibility but <em>not read from</em> in the latest Prometheus version. However, it is useful to understand the use of these parts and we will see in the next section what is it replaced with.</p>
<p>The aim of these sections is to index the possible values for a label name. For example if we have two series <code>{a="b1", x="y1"}</code> and <code>{a="b2", x="y2"}</code>, this section allows us to identify that the possible values for label name <code>a</code> are <code>[b1, b2]</code> and for <code>x</code> they are <code>[y1, y2]</code>. The format also allows indexing something like the label names <code>(a, x)</code> have the possible values <code>[(b1, y1), (b2, y2)]</code>, but we don't use this in Prometheus.</p>
<p><strong><code>Label Index i</code></strong></p>
<p>This is the format of a single <code>Label Index i</code> entry, so we have multiple of of these in sequence in no particular order. This is the format for a single <code>Label Index i</code>:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌───────────────┬────────────────┬────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;4b&gt;      │ #names &lt;4b&gt;    │ #entries &lt;4b&gt;  │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├───────────────┴────────────────┴────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌─────────────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ ref(value_0) &lt;4b&gt;                           │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├─────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ ...                                         │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├─────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ ref(value_n) &lt;4b&gt;                           │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └─────────────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│                      . . .                      │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ CRC32 &lt;4b&gt;                                      │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└─────────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>From the above examples, this helps us store the list <code>[b1, b2]</code>, <code>[y1, y2]</code>, <code>[(b1, y1), (b2, y2)]</code>, while each list getting its own entry in the index. <code>len</code> and <code>CRC32</code> is same as before.</p>
<p><code>#names</code> is the number of label names the values are for. For example if we are indexing for <code>a</code> or <code>x</code>, <code>#names</code> would be 1. If we are indexing for <code>(a, x)</code>, i.e. 2 label names, then <code>#names</code> would be 2.</p>
<p><code>#entries</code> is the number of possible values for the label names. If the names are <code>a</code> or <code>x</code> or even <code>(a, x)</code>, <code>#entries</code> is 2 because they have 2 possible values each.</p>
<p>It is followed by <code>#names * #entries</code> number of references to the value symbols.</p>
<p>Example for <code>[b1, b2]</code></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────┬───┬───┬─────────┬─────────┬───────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ 16 │ 1 │ 2 │ ref(b1) | ref(b2) | CRC32 |</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└────┴───┴───┴─────────┴─────────┴───────┘</span><br></span></code></pre></div></div>
<p>Example for <code>[(b1, y1), (b2, y2)]</code></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────┬───┬───┬─────────┬─────────┬─────────┬─────────┬───────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ 24 │ 2 │ 2 │ ref(b1) | ref(y1) │ ref(b2) | ref(y2) | CRC32 |</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└────┴───┴───┴─────────┴─────────┴─────────┴─────────┴───────┘</span><br></span></code></pre></div></div>
<p><strong><code>Label Offset Table</code></strong></p>
<p>While the <code>Label Index i</code> stores the list of possible values, <code>Label Offset Table</code> brings together the labels names and completes the label name-value index.</p>
<p>Here is the format of <code>Label Offset Table</code></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌─────────────────────┬──────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;4b&gt;            │ #entries &lt;4b&gt;        │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────┴──────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌────────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  n = 1 &lt;1b&gt;                            │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┬─────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ len(name) &lt;uvarint&gt;  │ name &lt;bytes&gt;    │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┴─────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  offset &lt;uvarint64&gt;                    │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └────────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│                    . . .                   │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├────────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│  CRC32 &lt;4b&gt;                                │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>This stores sequence of entries to point label name to its possible values, for example, point <code>a</code> to the above <code>Label Index i</code> containing <code>[b1, b2]</code>.</p>
<p>The above table has <code>len</code> and <code>CRC32</code> like other parts. <code>#entries</code> is the number of entries in this table. Followed by the actual entries.</p>
<p>Each entry start with <code>n</code> which is number of label names, followed by <code>n</code> number of actual label names and not symbols. If you noticed, the string <code>len(name) &lt;uvarint&gt;  │ name &lt;bytes&gt;</code> is same as how we stored in the symbol table. In Prometheus, we only have <code>n=1</code>, which means we only index possible label values for single label name, and not for tuples like <code>(a, x)</code>, because the possible number of such combinations would be huge and not practical to store them all.</p>
<p>Since we index single label names, we can afford to store the string directly as the number of label names are usually small and hence prevent loading of disk page from symbol table for the label name lookup.</p>
<p>The entry ends with an offset in the file which points to the start of relevant <code>Label Index i</code>. For example, for label name <code>a</code>, the offset will point to the <code>Label Index i</code> storing <code>[b1, b2]</code>. Label name <code>x</code> will point to the <code>Label Index i</code> storing <code>[y1, y2]</code>.</p>
<p>Since we are only indexing individual label names, we also don't store the <code>Label Index i</code> for tuples like <code>(a, x)</code> though we saw an example above that it is possible to do. It was <a href="https://github.com/prometheus-junkyard/tsdb/issues/26" target="_blank" rel="noopener noreferrer">once considered to have such composite label value index</a>, but it was dropped as there were not many use cases for it.</p>
<hr>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="e-postings-offset-table-and-postings-i">E. <code>Postings Offset Table</code> and <code>Postings i</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#e-postings-offset-table-and-postings-i" class="hash-link" aria-label="Direct link to e-postings-offset-table-and-postings-i" title="Direct link to e-postings-offset-table-and-postings-i">​</a></h4>
<p>These two are linked in a similar way as above where <code>Postings i</code> stores a list of postings and <code>Postings Offset Table</code> refers to those entries with the offset. If you can recall, a posting is a series ID, which in the context of this index is the offset at which the series entry starts in the file divided by 16 since it's 16 byte aligned.</p>
<p><strong><code>Postings i</code></strong></p>
<p>A single <code>Postings i</code> represents a "postings list", which is basically a sorted list of postings. Let us see the format of an individual such list and we will work with an example.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────────────────────┬────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;4b&gt;           │ #entries &lt;4b&gt;      │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├────────────────────┴────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌─────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ ref(series_1) &lt;4b&gt;                  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├─────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ ...                                 │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├─────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ ref(series_n) &lt;4b&gt;                  │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └─────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ CRC32 &lt;4b&gt;                              │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└─────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>This format cannot get much simpler. It has <code>len</code> and <code>CRC32</code> as usual. Followed by <code>#entries</code> which is the number of postings in this list, and then a sorted list of <code>#entries</code> number postings (series IDs, which is also the reference).</p>
<p>You might be wondering which postings do we store in this list. Let's take an example of these two series: <code>{a="b", x="y1"}</code> with series ID <code>120</code>, <code>{a="b", x="y2"}</code> with series ID <code>145</code>. Similar to how we looked at possible label values for a label name above, here we look at the possible series for a label-value pair. From the above example, <code>a="b"</code> is present in both the series, so we have to store a list <code>[120, 145]</code>. For <code>x="y1"</code> and <code>x="y2"</code>, they appear in only one of the series, so we have to store <code>[120]</code> and <code>[145]</code> for them respectively.</p>
<p>We only store the lists for the label pairs that we see in the series. So in the above example, we don't store postings list for something like <code>a="y1"</code> or <code>x="b"</code>, because they never appear in any series.</p>
<p><strong><code>Postings Offset Table</code></strong></p>
<p>Like how the <code>Label Offset Table</code> points a label name to possible values in <code>Label Index i</code>, similarly <code>Postings Offset Table</code> points a label-pair to possible postings in <code>Postings i</code>.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌─────────────────────┬──────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ len &lt;4b&gt;            │ #entries &lt;4b&gt;        │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├─────────────────────┴──────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌────────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  n = 2 &lt;1b&gt;                            │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┬─────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ len(name) &lt;uvarint&gt;  │ name &lt;bytes&gt;    │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┼─────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │ len(value) &lt;uvarint&gt; │ value &lt;bytes&gt;   │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────┴─────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │  offset &lt;uvarint64&gt;                    │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └────────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│                    . . .                   │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├────────────────────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│  CRC32 &lt;4b&gt;                                │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>This looks very similar to the <code>Label Offset Table</code>, but with an addition of the label value. <code>len</code> and <code>CRC32</code> is as usual.</p>
<p><code>#entries</code> is the number of entries in this table. <code>n</code> is always 2, which tells number of string elements that follow (i.e. a label name and a label value). Since we have <code>n</code> here, the table could possibly index composite label pairs like <code>(a="b", x="y1")</code>, but we don't do it as the use cases for that are very limited and don't have a good trade-off.</p>
<p><code>n</code> is followed by the actual string for the label name and the label value. Again, the individual label pairs are not a lot in general, hence we can afford storing the raw string here and avoid an indirection to the symbol table as this table will be accessed a lot of time. The main saving from the symbol table comes in the <code>Series</code> sections where the same symbol is repeated many times.</p>
<p>A single entry ends with an offset to the start of a postings list <code>Postings i</code>. From above example, an entry for <code>name="a", value="b"</code> will point to the postings list <code>[120, 145]</code>, entry for <code>name="x", value="y1"</code> will point to the postings list <code>[120]</code>.</p>
<p>The entries are sorted based on the label name and the value, first w.r.t. the name and for pairs with same names it's done w.r.t. the value. This allows us to run a binary search for the required label pair. Additionally, to get possible values for a given label name, we can get to the first label-pair that matches the name and iterate from there to get all the value. Hence this table replaces the <code>Label Offset Table</code> and <code>Label Index i</code>. This is another reason to store the actual strings here for faster access of label values.</p>
<p>This postings list and postings offset table form the inverted index. For indexing documents using an inverted index, for every word, we store a list of documents that it appears in. Similarly here, for every label-value pair, we store the list of series that it appears in.</p>
<p>This marks the end of the giant <code>index</code> section. Here is the <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md" target="_blank" rel="noopener noreferrer">link</a> for the upstream docs on the <code>index</code> format.</p>
<hr>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-tombstones">4. <code>tombstones</code><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#4-tombstones" class="hash-link" aria-label="Direct link to 4-tombstones" title="Direct link to 4-tombstones">​</a></h3>
<p>Tombstones are deletion markers, i.e., they tell us what time ranges of which series to ignore during reads. This is the only file in the block which is created and modified after writing a block to store the delete requests.</p>
<p>This is how the file looks</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────────────────────────────┬─────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ magic(0x0130BA30) &lt;4b&gt;     │ version(1) &lt;1 byte&gt; │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├────────────────────────────┴─────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌──────────────────────────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                Tombstone 1                   │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                      ...                     │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                Tombstone N                   │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │                  CRC&lt;4b&gt;                     │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └──────────────────────────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└──────────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<p>The <code>magic</code> number tells that this is a tombtones file (guess whose birthday is this number? hint: a Prometheus maintainer who implemented deletions in TSDB). The <code>version</code> tells us how to parse the file. It is followed by a sequence of tombstones which we will look at in just a second. The file ends with a checksum (<code>CRC32</code>) over all the tombstones.</p>
<p>Each individual tombstone looks like this</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌────────────────────────┬─────────────────┬─────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ series ref &lt;uvarint64&gt; │ mint &lt;varint64&gt; │ maxt &lt;varint64&gt; │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└────────────────────────┴─────────────────┴─────────────────┘</span><br></span></code></pre></div></div>
<p>The first field is the series reference (aka series ID, aka a posting) to which this tombstone belongs to. The <code>mint</code> through <code>maxt</code> is the time range that the deletion refers to, hence we should be skipping that time range for the series mentioned by the <code>series ref</code> while reading the chunks. When a single series has multiple non-overlapping deleted time ranges, they result in more than 1 tombtone.</p>
<p>Here is the <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/tombstones.md" target="_blank" rel="noopener noreferrer">link</a> for the upstream docs on the <code>tombstones</code> format.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="epilouge">Epilouge<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#epilouge" class="hash-link" aria-label="Direct link to Epilouge" title="Direct link to Epilouge">​</a></h2>
<p>In case of Head block, we have the inverted index in the memory along with the label name to possible values mapping efficiently stored in the memory.</p>
<p>In this blog post we have seen how the block looks on disk. Especially the index in detail which forms the majority of this post. You might be having many questions, like, what's the use of those sections in the index, what role they play during a query, what kind of queries are generally run on a block or the index, etc.</p>
<p>Since this blog post is already too long, we will be looking at all of them in the next blog post where we will talk about queries.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/block.go" target="_blank" rel="noopener noreferrer"><code>tsdb/block.go</code></a> has the code for reading and writing the meta file. In general, this is the hub for all things persistent block.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/chunks/chunks.go" target="_blank" rel="noopener noreferrer"><code>tsdb/chunks/chunks.go</code></a> has the code for reading and writing the files in the <code>chunks</code> directory.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/index/index.go" target="_blank" rel="noopener noreferrer"><code>tsdb/index/index.go</code></a> has the code for reading and writing the index file.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/tombstones/tombstones.go" target="_blank" rel="noopener noreferrer"><code>tsdb/tombstones/tombstones.go</code></a> has the code for reading and writing the tombstones file.</p>
<p>All these files point to the implementation of individual components of the block. We will see the code which brings all this together during the reading and writing of a block in the queries and compaction blog posts respectively.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
        </item>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk</guid>
            <pubDate>Fri, 02 Oct 2020 00:00:00 GMT</pubDate>
            <description><![CDATA[Dive into how memory mapping of in-memory chunks from disk is done.]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>In the <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a> of the TSDB blog series I mentioned that once a chunk is "full", it is flushed to the disk and memory mapped. This helps in reducing the memory footprint of the Head block and also helps speed up the WAL replay that we discussed in <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Part 2</a>. We will be diving deeper into how this is designed in Prometheus in this blog post.</p>
<p>As this is a part of the Prometheus TSDB blog series that I am writing, you are recommended to read the <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a> to know where these memory mapped chunks fit into TSDB (or the Head block) and <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Part 2</a> to understand the WAL replay.</p>
<p>I have also given a <a href="https://www.youtube.com/watch?v=suMhZfg9Cuk" target="_blank" rel="noopener noreferrer">KubeCon talk</a> on this which explains this at a little higher level.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="writing-these-chunks">Writing these chunks<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#writing-these-chunks" class="hash-link" aria-label="Direct link to Writing these chunks" title="Direct link to Writing these chunks">​</a></h2>
<p>Recapping from <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a>, when a chunk is full, we cut a new chunk and the older chunks become immutable and can only be read from (the yellow block below).</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb3-fcc2a659bb9dc466f2ad51278b9ef940.svg" width="875" height="364" class="img_ev3q"></p>
<p>And instead of storing it in memory, we flush it to disk and store a reference to access it later.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb4-5db3bd1d5402bab9a0804723ad2c79aa.svg" width="871" height="397" class="img_ev3q"></p>
<p>This flushed chunk is the memory-mapped chunk from disk. The immutability is the most important factor here else rewriting compressed chunks would have been too inefficient for every sample.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="format-on-disk">Format on disk<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#format-on-disk" class="hash-link" aria-label="Direct link to Format on disk" title="Direct link to Format on disk">​</a></h2>
<p>The format <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/head_chunks.md" target="_blank" rel="noopener noreferrer">can also be found on GitHub</a>.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="the-file">The File<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#the-file" class="hash-link" aria-label="Direct link to The File" title="Direct link to The File">​</a></h3>
<p>These chunks stay in its own directory called <code>chunks_head</code> and have a file sequence similar to WAL (except it starts with 1). For example:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">data</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├── chunks_head</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">|   └── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└── wal</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── checkpoint.000003</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    |   ├── 000000</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    |   └── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000004</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    └── 000005</span><br></span></code></pre></div></div>
<p>The max size of the file is kept at 128MiB. Now diving deeper into a single file, the file contains a header of 8B.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌──────────────────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│  magic(0x0130BC91) &lt;4 byte&gt;  │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│    version(1) &lt;1 byte&gt;       │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│    padding(0) &lt;3 byte&gt;       │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">├──────────────────────────────┤</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ┌──────────────────────────┐ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │         Chunk 1          │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │          ...             │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ ├──────────────────────────┤ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ │         Chunk N          │ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">│ └──────────────────────────┘ │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└──────────────────────────────┘</span><br></span></code></pre></div></div>
<p><code>Magic Number</code> is any number that can uniquely identify the file as a memory-mapped head chunks file. As I implemented this feature, I set it to my birth date :). <code>Chunk Format</code> tells us how to decode the chunks in the file. The extra padding is to allow any future header options that we might require.</p>
<p>After the file header, what follows is chunks.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="chunks">Chunks<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#chunks" class="hash-link" aria-label="Direct link to Chunks" title="Direct link to Chunks">​</a></h3>
<p>A single chunk looks like this</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">┌─────────────────────┬───────────────────────┬───────────────────────┬───────────────────┬───────────────┬──────────────┬────────────────┐</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">| series ref &lt;8 byte&gt; | mint &lt;8 byte, uint64&gt; | maxt &lt;8 byte, uint64&gt; | encoding &lt;1 byte&gt; | len &lt;uvarint&gt; | data &lt;bytes&gt; │ CRC32 &lt;4 byte&gt; │</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└─────────────────────┴───────────────────────┴───────────────────────┴───────────────────┴───────────────┴──────────────┴────────────────┘</span><br></span></code></pre></div></div>
<p>The <code>series ref</code> is the same series reference that we talked about in Part 2, it is the series id used to access the series in the memory. The <code>mint</code> and <code>maxt</code> are the minimum and maximum timestamp seen in the samples of the chunk. <code>encoding</code> is the encoding used to compress the chunks. <code>len</code> is the number of bytes that follow from here and <code>data</code> are the actual bytes of the compressed chunk.</p>
<p><code>CRC32</code> is the checksum of the above content of the chunk used to check the integrity of the data.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="reading-these-chunks">Reading these chunks<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#reading-these-chunks" class="hash-link" aria-label="Direct link to Reading these chunks" title="Direct link to Reading these chunks">​</a></h2>
<p>For every chunk, the Head block stores the mint and maxt of that chunk along with a reference in the memory to access it.</p>
<p>The reference is 8 bytes long. The first 4 bytes tell the file number in which the chunk exists, and the last 4 bytes tell the offset in the file where the chunk starts (i.e. the first byte of the <code>series ref</code>). If the chunk was in the file <code>00093</code> and the <code>series ref</code> starts at byte offset <code>1234</code> in the file, then the reference of that chunk would be <code>(93 &lt;&lt; 32) | 1234</code> (left shift bits and then bitwise OR).</p>
<p>We store the <code>mint</code> and <code>maxt</code> in Head so that we can select the chunk without having to look at the disk. When we do have to access the chunk, we only access the encoding and the chunk data using the reference.</p>
<p>In the code, the file looks like yet another byte slice (one slice per file) and accessing the slice at some index to get the chunk data while the OS maps the slice in the memory to the disk under the hood. <a href="https://en.wikipedia.org/wiki/Memory-mapped_file" target="_blank" rel="noopener noreferrer">Memory-mapping from disk</a> is an OS feature which fetches only the part of disk into memory which is being accessed and not the entire file.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="replaying-on-startup">Replaying on startup<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#replaying-on-startup" class="hash-link" aria-label="Direct link to Replaying on startup" title="Direct link to Replaying on startup">​</a></h2>
<p>In <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Part 2</a> we talked about WAL replay where we replay each individual sample to re-create the compressed chunk. Now that we have the compressed full chunks on disk, we don't need to go through recreation of these chunks while we still need to create chunks from WAL which were not full. Now with these memory-mapped chunks from disk, the replay happens as follows.</p>
<p>At startup, first we iterate through all the chunks in the <code>chunks_head</code> directory and build a map of <code>series ref -&gt; [list of chunk references along with mint and maxt belonging to this series ref]</code> in the memory.</p>
<p>We then continue with the WAL replay as described in <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Part 2</a> but with few modification:</p>
<ul>
<li>When we come across the <code>Series</code> record, after creation of the series, we look for the series reference in the above map and if any memory-mapped chunks exist, we attach that list to this series.</li>
<li>When we come across the <code>Samples</code> record, if the corresponding series for the sample has any memory-mapped chunks and if the sample falls into the time ranges that it covers, then we skip the sample. If it does not, then we ingest that sample into the Head block.</li>
</ul>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="enhancements-that-this-brings-in">Enhancements that this brings in<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#enhancements-that-this-brings-in" class="hash-link" aria-label="Direct link to Enhancements that this brings in" title="Direct link to Enhancements that this brings in">​</a></h2>
<p>What's the use of this additional complexity while we could get away with storing chunks in the memory and the WAL? This feature was added recently in 2020, so let's see what this brings in. (You can see some benchmark graphs in <a href="https://grafana.com/blog/2020/06/10/new-in-prometheus-v2.19.0-memory-mapping-of-full-chunks-of-the-head-block-reduces-memory-usage-by-as-much-as-40/" target="_blank" rel="noopener noreferrer">this Grafana Labs blog post</a>)</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-savings">Memory savings<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#memory-savings" class="hash-link" aria-label="Direct link to Memory savings" title="Direct link to Memory savings">​</a></h3>
<p>If you had to store the chunk in the memory, it can take anywhere between 120 to 200 bytes (or even more depending on compressibility of the samples). Now this is replaced with 24 bytes - 8 bytes each of chunks reference, min time, and max time of the chunk.</p>
<p>While this may sound like 80-90% reduction in memory, the reality is different. There are more things that the Head needs to store, like the in-memory index, all the symbols (label values), etc, and other parts of TSDB that take some memory.</p>
<p>In the real world, we can see a 15-50% reduction in the memory footprint depending on the rate at which samples are being scraped and the rate at which new series are being created (called "churn"). Another thing to note is that, if you are running some queries which touch a lot of these chunks on disk, then they need to be loaded into the memory to be processed. So it's not an absolute reduction in peak memory usage.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="faster-startup">Faster startup<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#faster-startup" class="hash-link" aria-label="Direct link to Faster startup" title="Direct link to Faster startup">​</a></h3>
<p>The WAL replay is the slowest part of startup. Mainly, (1) decoding of WAL records from disk and (2) rebuilding the compressed chunks from individual samples, are the slow parts in the replay. The iteration of memory-mapped chunks is relatively fast.</p>
<p>We cannot avoid decoding of records as we need to check all the records. As you saw above in the replay, we are skipping the samples which are in the memory-mapped chunks range. Here we avoid re-creating those full compressed chunks, hence save some time in the replay. It has been seen to reduce the startup time by 15-30%.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="garbage-collection">Garbage collection<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#garbage-collection" class="hash-link" aria-label="Direct link to Garbage collection" title="Direct link to Garbage collection">​</a></h2>
<p>The garbage collection in memory happens during the Head truncation where it just drops the reference of the chunks which is older than the truncation time <code>T</code>. But the files are still present on the disk. As with WAL segments, we also need to delete old m-mapped files regularly.</p>
<p>For every memory-mapped chunk file present (which means also open in TSDB), we store in the memory the absolute maximum time among all the chunks present in the file. For the live file (the one in which we are currently writing the chunks), we update this maximum time in the memory as and when we are adding new chunks. During a restart, as we iterate all the memory-mapped chunks, we restore the maximum time of the files in the memory there.</p>
<p>So when the Head truncation is happening for data before time <code>T</code>, we call truncation on these files for time <code>T</code>. The files whose maximum times is below <code>T</code> (except the live file) are deleted at this point while preserving the sequence (if the files were <code>5, 6, 7, 8</code> and if files <code>5</code> and <code>7</code> were beyond time <code>T</code>, only <code>5</code> is deleted and the remaining sequence would be <code>6, 7, 8</code>).</p>
<p>After truncation, we close the live file and start a new one because in low volume and small setups, it might take a lot of time to reach the max size of the file. So rotating the files here will help deletion of old chunks during the next truncation.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/chunks/head_chunks.go" target="_blank" rel="noopener noreferrer"><code>tsdb/chunks/head_chunks.go</code></a> has all the implementation of writing chunks to disk, accessing it using a reference, truncation, handling the files, and way to iterate over the chunks.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/head.go" target="_blank" rel="noopener noreferrer"><code>tsdb/head.go</code></a> uses the above as a black box to memory-map its chunks from disk.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
        </item>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 2): WAL and Checkpoint]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint</guid>
            <pubDate>Sat, 26 Sep 2020 00:00:00 GMT</pubDate>
            <description><![CDATA[Deep dive into how the Write-Ahead-Log (WAL) and checkpointing is designed in Prometheus TSDB.]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>In the <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a> of the TSDB blog series I mentioned that we write the incoming <a href="https://prometheus.io/docs/concepts/data_model/#samples" target="_blank" rel="noopener noreferrer">samples</a> into <a href="https://en.wikipedia.org/wiki/Write-ahead_logging" target="_blank" rel="noopener noreferrer">Write-Ahead-Log (WAL)</a> first for durability and that when this WAL is truncated, a checkpoint is created. In this blog post, we will briefly discuss the basics of WAL and then dive into how WAL and checkpoints are designed in Prometheus' TSDB.</p>
<p>As this is a part of the Prometheus TSDB blog series that I am writing, you are recommended to read the <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a> to know where the WAL fits into the TSDB.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="wal-basics">WAL Basics<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#wal-basics" class="hash-link" aria-label="Direct link to WAL Basics" title="Direct link to WAL Basics">​</a></h2>
<p>WAL is a <em>sequential</em> log of events that occur in a database. Before writing/modifying/deleting the data in the database, the event is first recorded (appended) into the WAL and then the necessary operations are performed in the database.</p>
<p>For whatever reason if the machine or the program decides to crash, you have the events recorded in this WAL which you can replay back in the same order to restore the data. This is particularly useful for in-memory databases where if the database crashes, the entire data in the memory is lost if not for WAL.</p>
<p>This is widely used in relational databases to provide <a href="https://en.wikipedia.org/wiki/Durability_(database_systems)" target="_blank" rel="noopener noreferrer">durability</a> (D from <a href="https://en.wikipedia.org/wiki/ACID" target="_blank" rel="noopener noreferrer">ACID</a>) for the database. Similarly, Prometheus has a WAL to provide durability for its Head block. Prometheus also uses WAL for graceful restarts to restore the in-memory state.</p>
<p>In the context of Prometheus, WAL is only used to record the events and restore the in-memory state when starting up. It does not involve in any other way in read or write operations.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="writing-to-wal-in-prometheus-tsdb">Writing to WAL in Prometheus TSDB<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#writing-to-wal-in-prometheus-tsdb" class="hash-link" aria-label="Direct link to Writing to WAL in Prometheus TSDB" title="Direct link to Writing to WAL in Prometheus TSDB">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="types-of-records">Types of records<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#types-of-records" class="hash-link" aria-label="Direct link to Types of records" title="Direct link to Types of records">​</a></h3>
<p>The write request in TSDB consists of label values of the <a href="https://prometheus.io/docs/concepts/data_model/" target="_blank" rel="noopener noreferrer">series</a> and their corresponding <a href="https://prometheus.io/docs/concepts/data_model/#samples" target="_blank" rel="noopener noreferrer">samples</a>. This gives us two types of records, <code>Series</code> and <code>Samples</code>.</p>
<p>The <code>Series</code> record consists of the label values of all the series in the write request. The creation of series yields a unique reference which can be used to look up the series. Hence the <code>Samples</code> record contains the reference of the corresponding series and list of samples that belongs to that series in the write request.</p>
<p>The last type of record is <code>Tombstones</code> used for delete requests. It contains the deleted series reference with time ranges to delete.</p>
<p>The format of these records can be found <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/wal.md" target="_blank" rel="noopener noreferrer">here</a>, we won't be discussing them in the blog post.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="writing-them">Writing them<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#writing-them" class="hash-link" aria-label="Direct link to Writing them" title="Direct link to Writing them">​</a></h3>
<p>The <code>Samples</code> record is written for all write requests that contain a sample. The <code>Series</code> record is written only once for a series when we see it for the first time (hence "create" it in the Head).</p>
<p>If a write request contains a new series, the <code>Series</code> record is always written before the <code>Samples</code> record, else during replay the series reference in the <code>Samples</code> record won't point to any series if the <code>Samples</code> record is placed before <code>Series</code>.</p>
<p>The <code>Series</code> record is written <em>after</em> creation of the series in the Head to also store the reference in the record, while <code>Samples</code> record is written <em>before</em> adding samples to the Head.</p>
<p>Only one <code>Series</code> and <code>Samples</code> record is written per write request by grouping all the different time series (and samples of different time series) in the same record. If the series for all the samples in the request already exist in the Head, only a <code>Samples</code> record is written into the WAL.</p>
<p>When we receive a delete request, we don't immediately delete it from the memory. We store something called "tombstones" which indicates the deleted series and time range of deletion. We write a <code>Tombstones</code> record into the WAL before processing the delete request.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="how-it-looks-on-disk">How it looks on disk<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#how-it-looks-on-disk" class="hash-link" aria-label="Direct link to How it looks on disk" title="Direct link to How it looks on disk">​</a></h3>
<p>The WAL is stored as a sequence of numbered files with 128MiB each by default. A WAL file here is called a "segment".</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">data</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└── wal</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000000</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    └── 000002</span><br></span></code></pre></div></div>
<p>The size of a file is bounded to make garbage collection of old files simpler. As you can guess, the sequence number <em>always</em> increases.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="wal-truncation-and-checkpointing">WAL truncation and Checkpointing<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#wal-truncation-and-checkpointing" class="hash-link" aria-label="Direct link to WAL truncation and Checkpointing" title="Direct link to WAL truncation and Checkpointing">​</a></h2>
<p>We need to regularly delete the old WAL segments, else, the disk will eventually fill up and the startup of TSDB will take a lot of time as it has to replay all the events in this WAL (where most of it will be discarded because it’s old). In general, any data that is no longer needed, you want to get rid of it.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="wal-truncation">WAL truncation<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#wal-truncation" class="hash-link" aria-label="Direct link to WAL truncation" title="Direct link to WAL truncation">​</a></h3>
<p>The WAL truncation is done just <em>after</em> the Head block is truncated (see <a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Part 1</a> for Head truncation). The files cannot be deleted at random and the deletion happens for first N files while not creating a gap in the sequence.</p>
<p>Because the write requests can be random, it is not easy or efficient to determine the time range of the samples in a WAL segment without going through all the records. So we delete the first <code>2/3rd</code> of the segments.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">data</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└── wal</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000000</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000002</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000003</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000004</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    └── 000005</span><br></span></code></pre></div></div>
<p>In the above example, the files <code>000000</code> <code>000001</code> <code>000002</code> <code>000003</code> will be deleted.</p>
<p>There is one catch here: the series records are written only once, so if you blindly delete the WAL segments, you will lose those records and hence can't restore those series on startup. Also, there might be samples in those first <code>2/3rd</code> segments which are not truncated from the Head yet, hence you lose them too. This is where checkpoints come into picture.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="checkpointing">Checkpointing<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#checkpointing" class="hash-link" aria-label="Direct link to Checkpointing" title="Direct link to Checkpointing">​</a></h3>
<p>Before truncating the WAL, we create a "checkpoint" from those WAL segments to be deleted. You can consider a checkpoint as a filtered WAL. Consider if the truncation of Head is happening for data before time <code>T</code>, taking the above example of WAL layout, the checkpointing operation will go through all the records in <code>000000</code> <code>000001</code> <code>000002</code> <code>000003</code> in order and:</p>
<ol>
<li>Drops all the series records for series which are no longer in the Head.</li>
<li>Drops all the samples which are before time <code>T</code>.</li>
<li>Drops all the tombstone records for time ranges before <code>T</code>.</li>
<li>Retain back remaining series, samples and tombstone records in the same way as you find it in the WAL (in the same order as they appear in the WAL).</li>
</ol>
<p>The drop operation can also be a re-write operation while dropping the unnecessary items from the record (because a single record can contain more than one series, sample or tombstone).</p>
<p>This way you won't lose the series, samples and tombstones which are still in the Head. The checkpoint is named as <code>checkpoint.X</code> where <code>X</code> is the last segment number on which the checkpoint was being created (<code>00003</code> here; you will know why we do like this in the next section).</p>
<p>After the WAL truncation and checkpointing, the files on disk look something like this (checkpoint looks like yet another WAL):</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">data</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">└── wal</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── checkpoint.000003</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    |   ├── 000000</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    |   └── 000001</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    ├── 000004</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    └── 000005</span><br></span></code></pre></div></div>
<p>If there were any older checkpoints, they are deleted at this point.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="replaying-the-wal">Replaying the WAL<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#replaying-the-wal" class="hash-link" aria-label="Direct link to Replaying the WAL" title="Direct link to Replaying the WAL">​</a></h2>
<p>We first iterate over the records in order from the last checkpoint (the checkpoint with the biggest number associated with it is the last). For <code>checkpoint.X</code>, <code>X</code> tells us from which WAL segment we need to continue the replay, and that is <code>X+1</code>. So in the above example, after replaying <code>checkpoint.000003</code>, we continue the replay from WAL segment <code>000004</code>.</p>
<p>You might be thinking why we need to track the segment number in the checkpoint while we anyway delete the WAL segments before it. The thing is, creation of a checkpoint and deletion of WAL segments are not atomic. Anything can happen in between and prevent deletion of WAL segments. So we will have to replay the additional <code>2/3rd</code> of the WAL segments which would have been deleted, making replay slower.</p>
<p>Talking about individual records, the following actions are taken on them:</p>
<ol>
<li><code>Series</code>: Create the series in the Head with the same reference as mentioned in the record (so that we can match the samples later). There could be multiple series records for the same series which is handled by Prometheus by mapping the references.</li>
<li><code>Samples</code>: Add samples from this record to the Head. The reference in the record indicates which series to add to. If no series is found for the reference, the sample is skipped.</li>
<li><code>Tombstones</code>: Store those tombstones back in Head by using the reference to identify the series.</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="low-level-details-of-writing-to-and-reading-from-wal">Low level details of writing to and reading from WAL<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#low-level-details-of-writing-to-and-reading-from-wal" class="hash-link" aria-label="Direct link to Low level details of writing to and reading from WAL" title="Direct link to Low level details of writing to and reading from WAL">​</a></h2>
<p>When the write requests are coming at a high volume, you want to avoid writing to disk randomly to avoid <a href="https://en.wikipedia.org/wiki/Write_amplification" target="_blank" rel="noopener noreferrer">write amplification</a>. Additionally, when you are reading the record, you want to be sure that it is not corrupted (easily possible on abrupt shutdown or faulty disk).</p>
<p>Prometheus has a general implementation of WAL where a record is just a slice of bytes and the caller has to take care of encoding the record. To solve the above two problems, the WAL package does the following:</p>
<ol>
<li>Data is written to the disk one page at a time. One page is 32KiB long. If the record is bigger than 32KiB, it is broken down into smaller pieces with each piece receiving a WAL record header for some bookkeeping to know if the piece is the end of record, or the start, or in the middle (A record receives a WAL record header even if it fits in the page).</li>
<li>A checksum of the record is appended at the end to detect any corruption while reading.</li>
</ol>
<p>The WAL package takes care of seamlessly joining the pieces of records and checks the checksum of the record while iterating through the records for replay.</p>
<p>The WAL records are not heavily compressed by default (or compressed at all). So the WAL package gives an option to compress the records using <a href="https://en.wikipedia.org/wiki/Snappy_(compression)" target="_blank" rel="noopener noreferrer">Snappy</a> (enabled by default now). This information is stored in the WAL record header, so the compressed and uncompressed records can live together if you plan to enable or disable compression.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p>The WAL implementation which takes record as slice of bytes and does the low level disk interactions is present in <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/wal/wal.go" target="_blank" rel="noopener noreferrer"><code>tsdb/wal/wal.go</code></a>. This file has the implementation for both writing the byte records and also iterating the records (again as a slice of bytes).</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/record/record.go" target="_blank" rel="noopener noreferrer"><code>tsdb/record/record.go</code></a> contains the various records with its encoding and decoding logic.</p>
<p>The checkpointing logic is present in <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/wal/checkpoint.go" target="_blank" rel="noopener noreferrer"><code>tsdb/wal/checkpoint.go</code></a>.</p>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/head.go" target="_blank" rel="noopener noreferrer"><code>tsdb/head.go</code></a> contains the remaining:</p>
<ol>
<li>Creating and encoding the records and calling the WAL write.</li>
<li>Calling the checkpointing and WAL truncation.</li>
<li>Replaying the WAL records, decoding them and restoring the in-memory state.</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
            <category>WAL</category>
        </item>
        <item>
            <title><![CDATA[Prometheus TSDB (Part 1): The Head Block]]></title>
            <link>https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block</link>
            <guid>https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block</guid>
            <pubDate>Sat, 19 Sep 2020 00:00:00 GMT</pubDate>
            <description><![CDATA[Walk-through on how the in-memory part of Prometheus TSDB works]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>Though Prometheus 2.0 was launched about 3 years ago, there are not much resources to understand it's TSDB other than <a href="https://web.archive.org/web/20220205173824/https://fabxc.org/tsdb/" target="_blank" rel="noopener noreferrer">Fabian's blog post</a>, which is very high level, and the <a href="https://github.com/prometheus/prometheus/tree/master/tsdb/docs/format" target="_blank" rel="noopener noreferrer">docs on formats</a> is more like a developer reference.</p>
<p>The Prometheus' TSDB has been attracting lots of new contributors lately and understanding it has been one of the pain points due to lack of resources. So, I plan to discuss in detail about the working of TSDB in a series of blog posts along with some references to the code for the contributors.</p>
<p>In this blog post, I mainly talk about the in-memory part of the TSDB — the Head block — while I will dive deeper into other components like WAL and it's checkpointing, how the memory-mapping of chunks is designed, compaction, the persistent blocks and it's index, and the upcoming snapshotting of chunks in future blog posts.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="prologue">Prologue<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#prologue" class="hash-link" aria-label="Direct link to Prologue" title="Direct link to Prologue">​</a></h2>
<p><a href="https://web.archive.org/web/20220205173824/https://fabxc.org/tsdb/" target="_blank" rel="noopener noreferrer">Fabian's blog post</a> is a good read to understand the data model, core concepts, and the high level picture of how the TSDB is designed. He also <a href="https://www.youtube.com/watch?v=b_pEevMAC3I" target="_blank" rel="noopener noreferrer">gave a talk at PromCon 2017</a> on this. I recommend reading the blog post or watching the talk before you dive into this one to set a good base.</p>
<p>All of what I explain in <em>this</em> blog post about the lifecycle of a <a href="https://prometheus.io/docs/concepts/data_model/#samples" target="_blank" rel="noopener noreferrer">sample</a> in Head is also explained in my <a href="https://www.youtube.com/watch?v=suMhZfg9Cuk" target="_blank" rel="noopener noreferrer">KubeCon talk</a> if you prefer that.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="small-overview-of-tsdb">Small Overview of TSDB<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#small-overview-of-tsdb" class="hash-link" aria-label="Direct link to Small Overview of TSDB" title="Direct link to Small Overview of TSDB">​</a></h2>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb1-9dce57fbe455a6163a84d68c9c73c7dd.svg" width="680" height="149" class="img_ev3q"></p>
<p>In the figure above, the Head block is the in-memory part of the database and the grey blocks are persistent blocks on disk which are immutable. We have a Write-Ahead-Log (WAL) for durable writes. An incoming sample (the pink box) first goes into the Head block and stays into the memory for a while, which is then flushed to the disk and memory-mapped (the blue box). And when these memory mapped chunks or the in-memory chunks get old to a certain point, they are flushed to the disk as persistent blocks. Further multiple blocks are merged as they get old and finally deleted after they go beyond the retention period.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="life-of-a-sample-in-the-head">Life of a Sample in the Head<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#life-of-a-sample-in-the-head" class="hash-link" aria-label="Direct link to Life of a Sample in the Head" title="Direct link to Life of a Sample in the Head">​</a></h2>
<p>All the discussions here are about a single <a href="https://prometheus.io/docs/concepts/data_model/" target="_blank" rel="noopener noreferrer">time series</a> and the same applies to all the series.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb2-3e96b764cc0a7e28988714462be15b02.svg" width="875" height="364" class="img_ev3q"></p>
<p>The samples are stored in compressed units called a "chunk". When a sample is incoming, it is ingested into the "active chunk" (the red block). It is the only unit where we can actively write data.</p>
<p>While committing the sample into the chunk, we also record it in the Write-Ahead-Log (WAL) on disk (the brown block) for durability (which means we can recover the in-memory data from that even if the machine crashes abruptly). I will write a separate blog post about how WAL is handled in Prometheus.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb3-fcc2a659bb9dc466f2ad51278b9ef940.svg" width="875" height="364" class="img_ev3q"></p>
<p>Once the chunk fills till 120 samples (or) spans upto chunk/block range (let's call it <code>chunkRange</code>), which is 2h by default, a new chunk is cut and the old chunk is said to be "full". For this blog post, we will consider the scape interval to be 15s, so 120 samples (a full chunk) would span 30m.</p>
<p>The yellow block with number 1 on it is the full chunk which just got filled while the red chunk is the new chunk that was created.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb4-5db3bd1d5402bab9a0804723ad2c79aa.svg" width="871" height="397" class="img_ev3q"></p>
<p>Since Prometheus v2.19.0, we are not storing all the chunks in the memory. As soon as a new chunk is cut, the full chunk is flushed to the disk and memory-mapped from the disk while only storing a reference in the memory. With memory-mapping, we can dynamically load the chunk into the memory with that reference when needed; it's a feature provided by the Operating System.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb5-1d622e6852dde75dd1dbf97fa930dacf.svg" width="871" height="397" class="img_ev3q"></p>
<p>Similarly, as new samples keep coming in, new chunks are cut.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb6-a57d88b200d5914f19f376d8c0603d52.svg" width="871" height="397" class="img_ev3q"></p>
<p>And they are flushed to the disk and memory-mapped.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb7-0ce6f9b57faf5450373e12db03c2bab7.svg" width="871" height="397" class="img_ev3q"></p>
<p>After some time the Head block would look like above. If we consider the red chunk to be almost full, then we have 3h of data in Head (6 chunks spanning 30m each). That is <code>chunkRange*3/2</code>.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb8-2143f3ae9296366a5998fb78ee2320d1.svg" width="871" height="397" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/tsdb9-73e001cb1662df81b619a2bafc33351d.svg" width="871" height="397" class="img_ev3q"></p>
<p>When the data in the Head spans <code>chunkRange*3/2</code>, the first <code>chunkRange</code> of data (2h here) is compacted into a persistent block. If you noticed above, the WAL is truncated at this point and a "checkpoint" is created (not shown in the diagram). I will be going into details of this checkpointing, WAL truncation, compaction, persistent block and it's index in future blog posts.</p>
<p>This cycle of ingestion of samples, memory-mapping, compaction to form a persistent block, continues. And this forms the basic functionality of the Head block.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="few-more-things-to-noteunderstand">Few more things to note/understand<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#few-more-things-to-noteunderstand" class="hash-link" aria-label="Direct link to Few more things to note/understand" title="Direct link to Few more things to note/understand">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="where-is-the-index">Where is the index?<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#where-is-the-index" class="hash-link" aria-label="Direct link to Where is the index?" title="Direct link to Where is the index?">​</a></h3>
<p>It is in the memory and stored as an inverted index. More about the overall idea of this index is in Fabian's blog post. When the compaction of Head block occurs creating a persistent block, Head block is truncated to remove old chunks and garbage collection is done on this index to remove any series entries that do not exist anymore in the Head.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="handling-restarts">Handling Restarts<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#handling-restarts" class="hash-link" aria-label="Direct link to Handling Restarts" title="Direct link to Handling Restarts">​</a></h3>
<p>In case the TSDB has to restart (gracefully or abruptly), it uses the on-disk memory-mapped chunks and the WAL to replay back the data and events and recontruct the in-memory index and chunk.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="code-reference">Code reference<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#code-reference" class="hash-link" aria-label="Direct link to Code reference" title="Direct link to Code reference">​</a></h2>
<p><a href="https://github.com/prometheus/prometheus/blob/master/tsdb/db.go" target="_blank" rel="noopener noreferrer"><code>tsdb/db.go</code></a> coordinates the overall functioning of the TSDB.</p>
<p>For the parts relevant in the blog post, the core logic of ingestion for the in-memory chunks is all in <a href="https://github.com/prometheus/prometheus/blob/master/tsdb/head.go" target="_blank" rel="noopener noreferrer"><code>tsdb/head.go</code></a> which uses WAL and memory mapping as a black box.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="here-is-the-entire-prometheus-tsdb-blog-series">Here is the entire Prometheus TSDB blog series<a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block#here-is-the-entire-prometheus-tsdb-blog-series" class="hash-link" aria-label="Direct link to Here is the entire Prometheus TSDB blog series" title="Direct link to Here is the entire Prometheus TSDB blog series">​</a></h2>
<ol>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/">Prometheus TSDB (Part 1): The Head Block</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/">Prometheus TSDB (Part 2): WAL and Checkpoint</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk/">Prometheus TSDB (Part 3): Memory Mapping of Head Chunks from Disk</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/">Prometheus TSDB (Part 4): Persistent Block and its Index</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-queries/">Prometheus TSDB (Part 5): Queries</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention/">Prometheus TSDB (Part 6): Compaction and Retention</a></li>
<li><a href="https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/">Prometheus TSDB (Part 7): Snapshot on Shutdown</a></li>
</ol>]]></content:encoded>
            <category>Prometheus</category>
            <category>TSDB</category>
            <category>In-Memory Database</category>
        </item>
        <item>
            <title><![CDATA[“Optimisations” to Avoid]]></title>
            <link>https://ganeshvernekar.com/blog/optimisations-to-avoid</link>
            <guid>https://ganeshvernekar.com/blog/optimisations-to-avoid</guid>
            <pubDate>Sun, 06 Sep 2020 00:00:00 GMT</pubDate>
            <description><![CDATA[Collection of stuff that I learnt on my journey as a Software Engineer]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/optimisations-to-avoid#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>During the transition from writing code for the University assignments to writing real world softwares, I had to unlearn many things and learn to not overuse low level optimisations where it was not required. I had a big exposure to compiler optimisations at my University.</p>
<p>I am going to share a couple of “optimisations” to avoid that I see newbies often do, and I won't deny that I haven't done them myself (and still do). I have that in quotes because they are not really optimisations when all factors considered. I will follow up with more blog posts when I have more patterns to share.</p>
<p>As I work mostly on Go, some terms that I use will be Go specific (for example <code>package</code>).</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-optimisations">The Optimisations<a href="https://ganeshvernekar.com/blog/optimisations-to-avoid#the-optimisations" class="hash-link" aria-label="Direct link to The Optimisations" title="Direct link to The Optimisations">​</a></h2>
<blockquote>
<p>"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We <em>should</em> forget about small efficiencies, say about 97% of the time: <strong>premature optimization is the root of all evil.</strong> Yet we should not pass up our opportunities in that critical 3%." — Donald Knuth</p>
</blockquote>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-not-breaking-a-function-into-smaller-logical-functions">1. Not breaking a function into smaller logical functions<a href="https://ganeshvernekar.com/blog/optimisations-to-avoid#1-not-breaking-a-function-into-smaller-logical-functions" class="hash-link" aria-label="Direct link to 1. Not breaking a function into smaller logical functions" title="Direct link to 1. Not breaking a function into smaller logical functions">​</a></h3>
<p>...to avoid more function calls and maximum reuse of variables; at least that’s what I thought of when I used to keep a big function .</p>
<p>While calling many different methods is a performance overhead in terms of copying of function arguments and machine instruction indirection which might lead to cache misses, it is often too tiny to even notice. Smaller isolated logic is easier to understand, saves time in reviews and debugging, and easier to make changes.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-using-globals-when-not-necessary">2. Using globals when not necessary<a href="https://ganeshvernekar.com/blog/optimisations-to-avoid#2-using-globals-when-not-necessary" class="hash-link" aria-label="Direct link to 2. Using globals when not necessary" title="Direct link to 2. Using globals when not necessary">​</a></h3>
<p>I still remember the time when I used globals to “efficiently” pass data between functions during one of my internships and my manager was visibly annoyed. Fix was to pass those globals as function arguments from the origin.</p>
<p>While it might seem that globals simplify your code, when used incorrectly, it can often bring a lot of problems (especially when the package is used in more than one place) and make it hard to understand the behaviour. Use globals only for constants and for config variables in rare cases which cannot be passed in other ways.</p>
<p>If the function argument to be passed is large, then pass it as a reference (with proper care to not change the data where not intended).</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-dont-spend-a-lot-of-time-optimising">3. Don’t spend a lot of time optimising<a href="https://ganeshvernekar.com/blog/optimisations-to-avoid#3-dont-spend-a-lot-of-time-optimising" class="hash-link" aria-label="Direct link to 3. Don’t spend a lot of time optimising" title="Direct link to 3. Don’t spend a lot of time optimising">​</a></h3>
<p>This one is repeating the above quote which I came across midway writing this blog post: don’t spend a lot of time optimising the code which is not on the hot path and that is not run often. Focus on readability and evolvability. The time spent on such optimisations often negates the benefits (if any).</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="epilogue">Epilogue<a href="https://ganeshvernekar.com/blog/optimisations-to-avoid#epilogue" class="hash-link" aria-label="Direct link to Epilogue" title="Direct link to Epilogue">​</a></h2>
<p>Keep it simple. It is usually the case that someone else is going to touch (and also maintain) the code that you wrote. That's one of the biggest lessons I've learnt since I started "real world" coding where you often collaborate.</p>]]></content:encoded>
            <category>Software Development</category>
            <category>Beginner</category>
        </item>
        <item>
            <title><![CDATA[Google Summer of Code 2018 with Prometheus]]></title>
            <link>https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus</link>
            <guid>https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus</guid>
            <pubDate>Wed, 22 Aug 2018 00:00:00 GMT</pubDate>
            <description><![CDATA[Summary of my GSoC 2018 with Prometheus]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction">Introduction<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h2>
<p>I successfully completed Google Summer of Code with <a href="http://prometheus.io/" target="_blank" rel="noopener noreferrer">Prometheus</a> in the summer of 2018. I was mentored by <a href="https://github.com/gouthamve" target="_blank" rel="noopener noreferrer">Goutham Veeramachaneni</a></p>
<p>I did 3 independant addition/fixes as a part of my GSoC. All related to rules/alerting rules. Apart from my proposal, I also fixed some bugs in <a href="https://github.com/prometheus/tsdb" target="_blank" rel="noopener noreferrer">prometheus/tsdb</a> during GSoC period.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="fixes">Fixes<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#fixes" class="hash-link" aria-label="Direct link to Fixes" title="Direct link to Fixes">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-persist-for-state-of-alerts">1) Persist <code>for</code> State of Alerts<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#1-persist-for-state-of-alerts" class="hash-link" aria-label="Direct link to 1-persist-for-state-of-alerts" title="Direct link to 1-persist-for-state-of-alerts">​</a></h3>
<p>Prometheus had 1 serious long standing issue, where, if the Prometheus server crashes, the state of the alert is lost.</p>
<p>Consider that you have an alert with <code>for</code> duration as <code>24hrs</code>, and Prometheus crashed while that alert has been active for <code>23hrs</code>, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for <code>24hrs</code> again before firing!</p>
<p><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#1-persist-for-state-of-alerts-1">Jump to this section</a></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="features">Features<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#features" class="hash-link" aria-label="Direct link to Features" title="Direct link to Features">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-unit-testing-for-rules">2) Unit Testing for Rules<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#2-unit-testing-for-rules" class="hash-link" aria-label="Direct link to 2) Unit Testing for Rules" title="Direct link to 2) Unit Testing for Rules">​</a></h3>
<p>Alerting is an important feature in monitoring when it comes to maintaining site reliability, and Prometheus is being used widely for this. We also record many rules to visualise later. Hence it becomes very important to be able to check the correctness of the rules.</p>
<p>In this feature, I added the support of unit testing for both alerting and recording rules.</p>
<p><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#2-unit-testing-for-rules-1">Jump to this section</a></p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-ui-for-testing-alerting-rules">3) UI for Testing Alerting Rules<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#3-ui-for-testing-alerting-rules" class="hash-link" aria-label="Direct link to 3) UI for Testing Alerting Rules" title="Direct link to 3) UI for Testing Alerting Rules">​</a></h3>
<p>As you saw above how important alerting rules are for monitoring, Prometheus also lacks any good and convenient way of visualising and testing the alert rules before it can be used.</p>
<p>In this feature I add a UI for entering your alerting rules and testing+visualising it on the real data that is there in your server.</p>
<p><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#3-ui-for-testing-alerting-rules-1">Jump to this section</a></p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="epilogue">Epilogue<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#epilogue" class="hash-link" aria-label="Direct link to Epilogue" title="Direct link to Epilogue">​</a></h2>
<p>This work would not have been possible without valuable inputs and reviews by <a href="https://github.com/brian-brazil" target="_blank" rel="noopener noreferrer">Brian Brazil</a> and <a href="https://github.com/juliusv" target="_blank" rel="noopener noreferrer">Julius Volz</a></p>
<p>I gave a lightning talk at <a href="https://promcon.io/2018-munich/" target="_blank">PromCon 2018</a> regarding all that you read above. It was held in Munich, Germany.</p>
<hr>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-persist-for-state-of-alerts-1">1) Persist <code>for</code> State of Alerts<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#1-persist-for-state-of-alerts-1" class="hash-link" aria-label="Direct link to 1-persist-for-state-of-alerts-1" title="Direct link to 1-persist-for-state-of-alerts-1">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="introduction-1">Introduction<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#introduction-1" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h3>
<p>This happens to be the first issue I fixed during my GSoC. You can find the <a href="https://github.com/prometheus/prometheus/pull/4061" target="_blank" rel="noopener noreferrer">PR#4061 here</a>, which is already merged into Prometheus master and is available from <code>v2.4.0</code> onwards.</p>
<p>This post assumes that you have a basic understanding of what monitoring is and how alerting is related to it.
If you are new to this world, <a href="https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting" target="_blank" rel="noopener noreferrer">this post</a> should help you get started.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="issue">Issue<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#issue" class="hash-link" aria-label="Direct link to Issue" title="Direct link to Issue">​</a></h3>
<p>To talk about alerting in Prometheus in layman terms, an alerting rule consists of a <code>condition</code>, <code>for</code> duration, and a <code>blackbox</code> to handle the alert.
So the simple trick here is, if the <code>condition</code> is <code>true</code> for <code>for</code> duration amount of time, we trigger an alert (called as 'firing' of alert) and give it to the <code>blackbox</code> to handle it in the way it wants, which can be sending a mail, message in slack, etc.</p>
<p>As discussed <a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus##1-persist-for-state-of-alerts" target="_blank">here</a>, consider that you have an alert with <code>for</code> duration as <code>24hrs</code>, and Prometheus crashed while that alert has been active (<code>condition</code> is <code>true</code>) for <code>23hrs</code>, i.e. 1hr before it would fire. Now when Prometheus is started again, it would have to wait for <code>24hrs</code> again before firing!</p>
<p>You can find the <a href="https://github.com/prometheus/prometheus/issues/422" target="_blank">GitHub issue #422 here</a></p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="the-fix">The Fix<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#the-fix" class="hash-link" aria-label="Direct link to The Fix" title="Direct link to The Fix">​</a></h3>
<p>Use time series to store the state! The procedure is something like this:</p>
<ol>
<li>During every evaluation of alerting rules, we record the <code>ActiveAt</code> (when did <code>condition</code> become <code>true</code> for the first time) of ever alert in a time series with name <code>ALERTS_FOR_STATE</code>, with all the labels of that alert. This is like any other time series, but only stored in local.</li>
<li>When Prometheus is restarted, a job runs for restoring the state of alerts after the second evaluation. We wait till the second evaluation so that we have enough data scraped to know the current active alerts.</li>
<li>For each alert which is active right now, the job looks for its corresponding <code>ALERTS_FOR_STATE</code> time series. The timestamp and the value of the last sample of the series gives us the info about when did Prometheus went down and when was the alert last active at.</li>
<li>So if the <code>for</code> duration was say <code>D</code>, alert became active at <code>X</code> and Prometheus crashed at <code>Y</code>, then the alert has to wait for more <code>D-(Y-X)</code> duration (Why? Think!). So variables of the alert are adjusted to make it wait for more <code>D-(Y-X)</code> time before firing, and not <code>D</code>.</li>
</ol>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="things-to-keep-in-mind">Things to keep in mind<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#things-to-keep-in-mind" class="hash-link" aria-label="Direct link to Things to keep in mind" title="Direct link to Things to keep in mind">​</a></h3>
<p><code>rules.alert.for-outage-tolerance | default=1h</code></p>
<p>This flag specifies how long Prometheus will be tolerant on downtime. So if Prometheus has been down longer than the time set in this flag, then the state of the alerts are not restored. So make sure to either change the value of flag depending on your need or get Prometheus up soon!</p>
<p><code>rules.alert.for-grace-period | default=10m</code></p>
<p>We would not like to fire an alert just after Prometheus is up. So we introduce something called "grace period", where if <code>D-(Y-X)</code> happens to be less than <code>rules.alert.for-grace-period</code>, then we wait for the grace period duration before firing the alert.</p>
<p>Note: We follow this logic only if the <code>for</code> duration was itself <code>&amp;ge; rules.alert.for-grace-period</code>.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="gotchas">Gotchas<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#gotchas" class="hash-link" aria-label="Direct link to Gotchas" title="Direct link to Gotchas">​</a></h3>
<p>As the <code>ALERTS_FOR_STATE</code> series is stored in local storage, if you happen to lose the local TSDB data while Prometheus is down, then you lose the state of the alert permanently.</p>
<hr>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-unit-testing-for-rules-1">2) Unit Testing for Rules<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#2-unit-testing-for-rules-1" class="hash-link" aria-label="Direct link to 2) Unit Testing for Rules" title="Direct link to 2) Unit Testing for Rules">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="introduction-2">Introduction<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#introduction-2" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h3>
<p>It is always good to do 1 last check of all the components of your code before you deploy it. We have seen how important alerting and recording is in the monitoring world. So why not test even the alerting and recording rules?</p>
<p>This was proposed long back in this <a href="https://github.com/prometheus/prometheus/issues/1695" target="_blank" rel="noopener noreferrer">GitHub issue #1695</a>, and I worked on this during my GSoC. The work can be found in <a href="https://github.com/prometheus/prometheus/pull/4350" target="_blank" rel="noopener noreferrer">this PR#4350</a>, which has been merged with Prometheus master.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="syntax">Syntax<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#syntax" class="hash-link" aria-label="Direct link to Syntax" title="Direct link to Syntax">​</a></h3>
<p>We use a separate file for specifying unit tests for alerting rules and PromQL expressions (in place of recording rules).
This syntax of the file is based on <a href="https://docs.google.com/document/d/1vH5nxFcdVCXlTuUfXwPoRghiE79Ce9_eq0Je3a5g2Vs/edit?usp=sharing" target="_blank" rel="noopener noreferrer">this design doc</a> which was constantly reviewed by Prometheus members.</p>
<p><em>Edit:</em> This blog post will not be updated with any changes to unit testing. It might get outdated in future, hence also have a look at official documentation <a href="https://github.com/prometheus/prometheus/blob/master/docs/configuration/unit_testing_rules.md" target="_blank" rel="noopener noreferrer">here</a>.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="the-file">The File<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#the-file" class="hash-link" aria-label="Direct link to The File" title="Direct link to The File">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># This is a list of rule files to consider for testing.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">rule_files</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;file_name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># optional, default = 1m</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">evaluation_interval</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;duration</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># The order in which group names are listed below will be the order of evaluation of </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># rule groups (at a given evaluation time). The order is guaranteed only for the groups mentioned below. </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># All the groups need not be mentioned below.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">group_eval_order</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;group_name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># All the test are listed here.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">tests</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;test_group</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="test_group"><code>&lt;test_group&gt;</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#test_group" class="hash-link" aria-label="Direct link to test_group" title="Direct link to test_group">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Series data</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">interval</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;duration</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">input_series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit tests for the above data.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit tests for alerting rules. We consider the alerting rules from the input file.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">alert_rule_test</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;alert_test_case</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit tests PromQL expressions.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">promql_expr_test</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;promql_test_case</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="series"><code>&lt;series&gt;</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#series" class="hash-link" aria-label="Direct link to series" title="Direct link to series">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># This follows the series notation (x{a="b", c="d"}). You can see an example below.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;string</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># This uses expanding notation. Example below.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;string</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="alert_test_case"><code>&lt;alert_test_case&gt;</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#alert_test_case" class="hash-link" aria-label="Direct link to alert_test_case" title="Direct link to alert_test_case">​</a></h4>
<p>Prometheus allows you to have same alertname for different alerting rules. Hence in this unit testing, you have to list the union of all the <strong>firing alerts</strong> for the alertname under a single <code>&lt;alert_test_case&gt;</code>.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># It's the time elapsed from time=0s when the alerts have to be checked.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">eval_time</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;duration</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Name of the alert to be tested.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">alertname</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;string</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># List of expected alerts which are firing under the given alertname at </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># given evaluation time. If you want to test if an alerting rule should </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># not be firing, then you can mention the above fields and leave 'exp_alerts' empty.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">exp_alerts</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;alert</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="alert"><code>&lt;alert&gt;</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#alert" class="hash-link" aria-label="Direct link to alert" title="Direct link to alert">​</a></h4>
<p>Remember, this alert shoud be firing.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># These are the expanded labels and annotations of the expected alert. </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Note: labels also include the labels of the sample associated with the </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># alert (same as what you see in `/alerts`, without series `__name__` and `alertname`)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">exp_labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token key atrule">&lt;labelname&gt;</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;string</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">exp_annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token key atrule">&lt;labelname&gt;</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;string</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="promql_test_case"><code>&lt;promql_test_case&gt;</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#promql_test_case" class="hash-link" aria-label="Direct link to promql_test_case" title="Direct link to promql_test_case">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Expression to evaluate</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">expr</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;string</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># It's the time elapsed from time=0s when the alerts have to be checked.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">eval_time</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;duration</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Expected samples at the given evaluation time.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">exp_samples</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> &lt;sample</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="sample"><code>&lt;sample&gt;</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#sample" class="hash-link" aria-label="Direct link to sample" title="Direct link to sample">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Labels of the sample in series notation.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;series_notation</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># The expected value of the promql expression.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">value</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> &lt;number</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><br></span></code></pre></div></div>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="example">Example<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#example" class="hash-link" aria-label="Direct link to Example" title="Direct link to Example">​</a></h3>
<p>This is an example input files for unit testing which passes the test. <code>alerts.yml</code> contains the alerting rule, <code>tests.yml</code> follows the syntax above.</p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="alertsyml"><code>alerts.yml</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#alertsyml" class="hash-link" aria-label="Direct link to alertsyml" title="Direct link to alertsyml">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># This is the rules file.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">groups</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> example</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">rules</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">alert</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> InstanceDown</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">expr</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> up == 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">for</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 5m</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token key atrule">severity</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> page</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token key atrule">summary</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"Instance {{ $labels.instance }} down"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token key atrule">description</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">alert</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> AnotherInstanceDown</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">expr</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> up == 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">for</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 10m</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token key atrule">severity</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> page</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token key atrule">summary</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"Instance {{ $labels.instance }} down"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token key atrule">description</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> "</span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"> $labels.instance </span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token plain"> of job </span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"> $labels.job </span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token plain"> has been down for more than 5 minutes."`&lt;/pre</span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="testyml"><code>test.yml</code><a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#testyml" class="hash-link" aria-label="Direct link to testyml" title="Direct link to testyml">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># This is the main input for unit testing. </span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Only this file is passed as command line argument.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">rule_files</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> alerts.yml</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">evaluation_interval</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 1m</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">tests</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Test 1.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">interval</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 1m</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Series data.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">input_series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'up{job="prometheus", instance="localhost:9090"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">            </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'up{job="node_exporter", instance="localhost:9100"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">            </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'1 1 1 1 1 1 1 0 0 0 0 0 0 0 0'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'go_goroutines{job="prometheus", instance="localhost:9090"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">            </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'10+10x2 30+20x5'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">series</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'go_goroutines{job="node_exporter", instance="localhost:9100"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">            </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'10+10x7 10+30x4'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit test for alerting rules.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">alert_rule_test</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit test 1.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">eval_time</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 10m</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token key atrule">alertname</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> InstanceDown</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token key atrule">exp_alerts</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Alert 1.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">              </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">exp_labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                    </span><span class="token key atrule">severity</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> page</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                    </span><span class="token key atrule">instance</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> localhost</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token number" style="color:rgb(247, 140, 108)">9090</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                    </span><span class="token key atrule">job</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> prometheus</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                </span><span class="token key atrule">exp_annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                    </span><span class="token key atrule">summary</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"Instance localhost:9090 down"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                    </span><span class="token key atrule">description</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"localhost:9090 of job prometheus has been down for more than 5 minutes."</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit tests for promql expressions.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">promql_expr_test</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Unit test 1.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">expr</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> go_goroutines </span><span class="token punctuation" style="color:rgb(199, 146, 234)">&gt;</span><span class="token plain"> 5</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token key atrule">eval_time</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 4m</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">          </span><span class="token key atrule">exp_samples</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Sample 1.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">              </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'go_goroutines{job="prometheus",instance="localhost:9090"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                </span><span class="token key atrule">value</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">50</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic"># Sample 2.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">              </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'go_goroutines{job="node_exporter",instance="localhost:9100"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">                </span><span class="token key atrule">value</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">50</span><br></span></code></pre></div></div>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="usage">Usage<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#usage" class="hash-link" aria-label="Direct link to Usage" title="Direct link to Usage">​</a></h3>
<p>This feature will come embedded in <code>promtool</code>.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"># For the above example.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">./promtool test rules test.yml</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"># If you have multiple such test files, say test{1,2,3}.yml</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">./promtool test rules test1.yml test2.yml test3.yml</span><br></span></code></pre></div></div>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="what-is-tested">What is tested?<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#what-is-tested" class="hash-link" aria-label="Direct link to What is tested?" title="Direct link to What is tested?">​</a></h3>
<ol>
<li>Syntax of the rule files included in the test.</li>
<li>Correcness of template variables. Note that, if you have used <code>$labels.something_wrong</code>, it wont be caught at this stage.</li>
<li>If the alerts listed for the alertname are exactly same as what we get after simulation over the data.</li>
<li>Exact match for the samples returned by PromQL expressions at given time. Order doesn't matter.</li>
</ol>
<p>While we do the matches in 3 and 4, usage of <code>$labels.something_wrong</code> will be caught as it will result in an empty string.</p>
<hr>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id="3-ui-for-testing-alerting-rules-1">3) UI for Testing Alerting Rules<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#3-ui-for-testing-alerting-rules-1" class="hash-link" aria-label="Direct link to 3) UI for Testing Alerting Rules" title="Direct link to 3) UI for Testing Alerting Rules">​</a></h2>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="introduction-3">Introduction<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#introduction-3" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction">​</a></h3>
<p>Before this work, Prometheus lacked any good and convenient way of visualising and testing the alert rules before it can be used.
Requests for the same have been made long ago in these issues <a href="https://github.com/prometheus/prometheus/issues/1154" target="_blank" rel="noopener noreferrer">#1154</a> <a href="https://github.com/prometheus/prometheus/issues/1220" target="_blank" rel="noopener noreferrer">1220</a>, long standing!</p>
<p>It will be added to Prometheus with <a href="https://github.com/prometheus/prometheus/pull/4277" target="_blank" rel="noopener noreferrer">this PR#4277</a>. Now let's learn more about this.</p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="the-ui">The UI<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#the-ui" class="hash-link" aria-label="Direct link to The UI" title="Direct link to The UI">​</a></h3>
<p>You will be able to access this tool at <code>/alert-rule-testing</code>.</p>
<p><em>open images in new tab for a better view</em></p>
<p>This is what you will see when you first open.</p>
<p><img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic1-6302f49ca69cdbffc69a0ad4654369cd.png" width="1920" height="956" class="img_ev3q"></p>
<p>You will enter your rules here in the same format as your would write your rule file.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic2-c589fa77aa316ca2c2853ef8d07dbaeb.png" width="1920" height="956" class="img_ev3q"></p>
<p>After you press <code>Execute</code>, you will see success/error messages here.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic3-4cc572ccd68d44b72b149ec710905af2.png" width="1920" height="956" class="img_ev3q"></p>
<p>If it was a success, you will see the graphs for the alert expression and <code>ALERT</code> series simulated over the existing data. Graphs are plotted only for the active alerts.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic4-01b898e04f6d7d70058268734edfeeca.png" width="1920" height="956" class="img_ev3q"></p>
<h3 class="anchor anchorWithStickyNavbar_LWe7" id="example-1">Example<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#example-1" class="hash-link" aria-label="Direct link to Example" title="Direct link to Example">​</a></h3>
<p>A simple alerting rule, and hit Execute!
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic5-28c338480e6dc912324021f63ff6bd13.png" width="1920" height="960" class="img_ev3q"></p>
<p>There is 1 active alert, you can see it's info here.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic6-3fdd00f0256df6e28a2591418c1e7ddc.png" width="1920" height="958" class="img_ev3q"></p>
<p>Graph of the expression and the corresponding <code>ALERT</code> graph. You can see that the alerting rule would save switched between <code>pending</code> and <code>firing</code> state twice in the current data.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic7-1c380062821bdd02cae7ec3f01422e11.png" width="1920" height="959" class="img_ev3q"></p>
<h4 class="anchor anchorWithStickyNavbar_LWe7" id="example-for-errors">Example for errors<a href="https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#example-for-errors" class="hash-link" aria-label="Direct link to Example for errors" title="Direct link to Example for errors">​</a></h4>
<p>Error in <code>expr</code>.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic8-afbe45457339a7b8c378f1bb1633a163.png" width="1920" height="589" class="img_ev3q"></p>
<p>Error in template variables.
<img decoding="async" loading="lazy" alt="image" src="https://ganeshvernekar.com/assets/images/uipic9-9ec92482c0758be7bf7e264f4438a550.png" width="1920" height="595" class="img_ev3q"></p>
<p>Stay tuned, it will be added to Prometheus soon!</p>]]></content:encoded>
            <category>GSoC</category>
            <category>Prometheus</category>
        </item>
    </channel>
</rss>