Metrics Collection - Learning Module

Loading content...

0/273

Prometheus Architecture

The Pull-Based Revolution

When Prometheus emerged from SoundCloud in 2012, it challenged the dominant paradigm of metrics collection. Rather than having applications push metrics to a central collector, Prometheus pulls metrics from applications. This seemingly simple architectural choice has profound implications for reliability, discovery, and operational simplicity.

Prometheus has since become the de facto standard for cloud-native metrics collection, with adoption spanning from startups to the largest technology companies. Its influence is so significant that the Cloud Native Computing Foundation (CNCF) graduated Prometheus as only its second project after Kubernetes.

Understanding Prometheus architecture isn't just about learning one tool—it's about understanding the design principles that govern modern observability systems.

What You Will Learn

By the end of this page, you will understand Prometheus's core architecture: the pull model, service discovery, the time-series database (TSDB), PromQL query language, and how these components work together. You'll also learn about federation, high availability patterns, and when Prometheus is (or isn't) the right choice.

Pull vs Push: Architectural Philosophy

The most distinctive architectural decision in Prometheus is its pull-based collection model. Traditional monitoring systems like Graphite use a push model—applications send metrics to the monitoring system. Prometheus inverts this: the server periodically scrapes metrics from targets.

Why Pull Works Better for Service Monitoring:

The pull model provides several operational advantages that may not be immediately obvious:

Pull vs Push Model Comparison
Dimension	Pull Model (Prometheus)	Push Model (Traditional)
Target Health	Instantly detect down targets (scrape fails)	Silent failure if target stops pushing
Discovery	Central discovery of all targets	Targets must know where to push
Rate Control	Server controls scrape rate	Clients can overwhelm server
Development	Easy local testing (just expose endpoint)	Need running collector to test
Configuration	Centralized at Prometheus server	Distributed across all clients
Network	Server initiates connections (easier firewall)	All clients need outbound access

The Scrape Model:

At its core, Prometheus performs this loop for each target:

Discover targets via service discovery or static configuration
Scrape the /metrics endpoint via HTTP(S)
Parse the exposition format (typically OpenMetrics or Prometheus text format)
Add timestamps and labels to each metric sample
Store in the time-series database
Wait for the next scrape interval (typically 15-60 seconds)

This model means that your application only needs to expose an HTTP endpoint that returns current metric values. No client library configuration for endpoints, no network issues reaching collectors, no concerns about buffering metrics during outages.

When Push Makes Sense

Pull isn't universally better. For short-lived jobs (batch processes, serverless functions), Prometheus provides the Pushgateway—a intermediary that accepts pushed metrics. Push also works better across network boundaries where Prometheus can't reach targets. However, for long-running services in modern orchestration environments, pull dominates.

Core Components

Prometheus's architecture consists of several interconnected components, each with distinct responsibilities:

Converting Mermaid diagram...

Prometheus Server Components

•Service Discovery — Automatically discovers scrape targets from Kubernetes, Consul, EC2, DNS, file-based configs, and more. Removes manual target management.
•Scraper — Periodically fetches metrics from discovered targets. Handles retries, timeouts, and tracks target health.
•Time Series Database (TSDB) — Custom storage engine optimized for time-series data. Provides efficient compression and fast queries.
•Rules Engine — Evaluates recording rules (pre-compute expensive queries) and alerting rules (define alert conditions).
•Query Engine — Executes PromQL queries against the TSDB. Powers both ad-hoc queries and dashboard visualizations.
•HTTP API — Exposes query, metadata, and administrative endpoints. Consumed by Grafana, Alertmanager, and other integrations.

External Components:

Alertmanager — Handles alert deduplication, grouping, routing, and notification delivery (email, Slack, PagerDuty, etc.)
Pushgateway — Allows short-lived jobs to push metrics, which Prometheus then scrapes
Exporters — Bridge applications that can't be directly instrumented (MySQL, Redis, Linux system metrics)
Client Libraries — Instrument your applications (Go, Java, Python, Ruby, .NET, and more)

Service Discovery

In dynamic environments like Kubernetes, service instances come and go constantly. Static configuration doesn't scale. Prometheus addresses this with powerful service discovery integrations that automatically find and track targets.

Service Discovery Mechanisms:

Prometheus supports numerous discovery mechanisms out of the box:

Service Discovery Integrations
Type	Use Case	Key Information Discovered
kubernetes_sd	K8s pods, services, endpoints	Pod name, namespace, labels, annotations
consul_sd	Consul service catalog	Service name, tags, node, datacenter
ec2_sd	AWS EC2 instances	Instance ID, tags, availability zone
gce_sd	Google Compute Engine	Instance name, labels, zone, project
dns_sd	DNS SRV/A records	Hostnames, ports from DNS
file_sd	File-based (JSON/YAML)	Manually managed target lists
azure_sd	Azure VMs	VM name, resource group, tags
openstack_sd	OpenStack instances	Instance name, flavor, metadata

prometheus.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Prometheus configuration with service discovery
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
# Rule files
rule_files:
  - '/etc/prometheus/rules/*.yml'
 
# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  # Kubernetes pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # Add pod labels as metric labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      
      # Add namespace and pod name labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
 
  # Kubernetes services for endpoint discovery
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_name]
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Relabeling: The Secret Weapon

Relabeling is Prometheus's powerful mechanism for transforming discovered metadata. It enables:

Filtering targets — Keep or drop based on labels (e.g., only scrape pods with specific annotations)
Modifying labels — Rename, copy, or transform labels
Setting scrape parameters — Dynamic paths, ports, or schemes based on metadata
Dropping labels — Remove high-cardinality or sensitive labels before storage

Relabeling happens in three phases:

relabel_configs — Before scraping (affects which targets are scraped)
metric_relabel_configs — After scraping (affects which metrics are stored)
alert_relabel_configs — Before sending to Alertmanager

Service Discovery Best Practice

Use annotations on your Kubernetes pods to control scraping: prometheus.io/scrape: 'true', prometheus.io/port: '8080', prometheus.io/path: '/metrics'. This makes applications self-describing—Prometheus discovers everything automatically without central configuration changes.

The Time Series Database (TSDB)

Prometheus includes a custom time-series database designed specifically for metrics workloads. Understanding its architecture helps you tune performance and plan capacity.

The Data Model:

Every Prometheus time series is uniquely identified by a metric name and a set of key-value labels:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Internally, this becomes a series identifier that indexes a sequence of (timestamp, value) pairs. The combination of metric name and labels is fingerprinted into a unique series ID.

Storage Architecture:

Prometheus TSDB organizes data into blocks:

data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/   # Block (2 hours of data)
│   ├── chunks/                    # Compressed sample data
│   │   └── 000001
│   ├── index                      # Label indexes for fast lookup
│   ├── meta.json                  # Block metadata
│   └── tombstones                 # Deleted series markers
├── 01BKGTZQ1SYQJTR4PB43C8PD98/   # Another block
├── 01BKGTZQ1XYQJTR4PB43C8PD99/   # Another block
├── lock
└── wal/                           # Write-ahead log (recent data)
    ├── 00000002
    └── checkpoint.00000001

Key Concepts:

Write-Ahead Log (WAL): Incoming samples first go to the WAL for durability. If Prometheus crashes, the WAL replays on startup.
Head Block: The most recent 2 hours of data, kept in memory for fast writes and queries.
Compaction: Older blocks are periodically merged and compressed into larger blocks, improving query efficiency.
Retention: Data older than the retention period (default: 15 days) is deleted.

TSDB Performance Characteristics
Operation	Performance	Factors Affecting
Write (ingest)	100K+ samples/second	Disk I/O, series churn
Query (instant)	<100ms typical	Series count, time range
Query (range)	Variable	Series count, duration, step
Compaction	Background, low priority	Block count, disk I/O
Memory usage	~1-2KB per active series	Series count, scrape interval
Disk usage	~1-2 bytes per sample	Compression ratio, retention

Compression:

Prometheus uses specialized compression for time-series data:

Timestamps: Delta-of-delta encoding (timestamps usually increment by fixed intervals)
Values: XOR encoding (consecutive values are often similar)
Achieved compression: Typically 1-2 bytes per sample (vs. 16 bytes uncompressed)

This aggressive compression enables Prometheus to store weeks of data for millions of time series on modest hardware.

Capacity Planning Formula:

disk_space = retention_seconds × samples_per_second × bytes_per_sample

Examples:
- 10000 series, 15s interval, 15 days retention, 1.5 bytes/sample
- samples/second = 10000 / 15 = 667
- disk = 15 × 24 × 3600 × 667 × 1.5 = ~1.3 GB

Series Churn

Series churn occurs when label values constantly change (e.g., using request IDs as labels). Each unique label combination creates a new series. High churn defeats compression, bloats the index, and degrades performance. Avoid labels with unbounded cardinality.

PromQL Query Language

PromQL (Prometheus Query Language) is a powerful functional language for querying time series data. It's the foundation for dashboards, alerts, and ad-hoc analysis.

Query Types:

PromQL supports four data types:

Instant vector: Set of time series, single sample per series at one timestamp
Range vector: Set of time series, range of samples per series
Scalar: Simple numeric floating point value
String: Simple string value (rarely used)

promql_examples.txt

PromQL

# SELECTORS
# -----------------------------------------
 
# Select all series for a metric
http_requests_total
 
# Filter by label equality
http_requests_total{method="GET"}
 
# Filter by regex match
http_requests_total{endpoint=~"/api/.*"}
 
# Negative match
http_requests_total{status!="200"}
 
# Multiple conditions (AND)
http_requests_total{method="GET", status="200"}
 
 
# FUNCTIONS ON COUNTERS
# -----------------------------------------
 
# Rate: per-second rate over 5 minutes
rate(http_requests_total[5m])
 
# Increase: total increase over 1 hour
increase(http_requests_total[1h])
 
# irate: instant rate (last two samples)
irate(http_requests_total[5m])
 
 
# AGGREGATIONS
# -----------------------------------------
 
# Sum across all series
sum(rate(http_requests_total[5m]))
 
# Sum grouped by method
sum by (method) (rate(http_requests_total[5m]))
 
# Sum excluding specific labels
sum without (instance, pod) (rate(http_requests_total[5m]))
 
# Average, min, max, count
avg(rate(http_requests_total[5m]))
max by (endpoint) (rate(http_requests_total[5m]))
count(rate(http_requests_total[5m]) > 100)
 
 
# HISTOGRAMS
# -----------------------------------------
 
# Calculate p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
 
# Aggregate across instances, then calculate percentile
histogram_quantile(0.95, 
  sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# Average request duration
rate(http_request_duration_seconds_sum[5m]) 
  / 
rate(http_request_duration_seconds_count[5m])
 
 
# OPERATORS
# -----------------------------------------
 
# Arithmetic
http_requests_total * 2
node_memory_used_bytes / node_memory_total_bytes * 100
 
# Comparison (filtering)
rate(http_requests_total[5m]) > 100
 
# Boolean comparison (returns 0 or 1)
rate(http_requests_total[5m]) > bool 100
 
# Vector matching
rate(http_requests_total[5m]) 
  / 
on(endpoint) group_left(team)
endpoint_ownership
 
 
# ADVANCED
# -----------------------------------------
 
# Absent: returns 1 if no series exist
absent(up{job="api"})
 
# Predict: linear prediction of value
predict_linear(node_disk_free_bytes[1h], 4 * 3600)
 
# Changes: number of value changes in range
changes(process_start_time_seconds[1d])
 
# Label manipulation
label_replace(up, "dc", "$1", "instance", "([^:]+):.*")

Common Query Patterns:

Essential PromQL Patterns
Pattern	Query	Use Case
Request rate	sum(rate(http_requests_total[5m]))	Throughput monitoring
Error rate	sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))	Error percentage
Latency p99	histogram_quantile(0.99, sum by(le)(rate(http_request_duration_seconds_bucket[5m])))	SLO tracking
Saturation	sum(rate(container_cpu_usage_seconds_total[5m])) / sum(machine_cpu_cores)	Resource usage
Up time	avg_over_time(up[1d])	Availability calculation

rate() vs irate()

rate() calculates the average rate over the entire range window, smoothing spikes. irate() uses only the last two samples, showing instantaneous rate. Use rate() for alerting (more stable) and irate() for graphs (shows spikes). Rule of thumb: rate() for dashboards and alerts; irate() for debugging.

Recording Rules

Recording rules allow you to precompute frequently used or computationally expensive expressions and save them as new time series. This is critical for maintaining dashboard performance and reducing query load.

Why Recording Rules Matter:

Query performance: Pre-aggregated data queries faster
Dashboard consistency: Multiple dashboards use the same computed metrics
Reduced load: Expensive queries run once per evaluation, not per dashboard load
Simplified alerting: Alert on precomputed values instead of complex expressions

recording_rules.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Recording rules for API metrics
groups:
  - name: api_metrics
    interval: 30s  # Evaluation interval
    rules:
      # Request rate by endpoint (used in many dashboards)
      - record: job:http_requests:rate5m
        expr: sum by (job, endpoint) (rate(http_requests_total[5m]))
 
      # Error ratio by service
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
 
      # P99 latency by endpoint (expensive to compute)
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )
 
      # P50 latency by endpoint
      - record: job:http_request_duration:p50
        expr: |
          histogram_quantile(0.50,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )
 
  - name: infrastructure_metrics
    interval: 1m
    rules:
      # CPU usage per node
      - record: node:cpu_usage:ratio1m
        expr: |
          1 - avg by (node) (
            rate(node_cpu_seconds_total{mode="idle"}[1m])
          )
 
      # Memory usage per node
      - record: node:memory_usage:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          )
 
      # Pod restart rate (signals instability)
      - record: namespace:pod_restarts:increase1h
        expr: |
          sum by (namespace) (
            increase(kube_pod_container_status_restarts_total[1h])
          )

Naming Convention:

Prometheus recommends a consistent naming pattern for recording rules:

level:metric:operations

Where:

level: The aggregation level (job, instance, namespace, etc.)
metric: The metric name being aggregated
operations: List of operations applied (rate5m, ratio, p99, etc.)

Examples:

job:http_requests:rate5m — Requests per second by job
instance:cpu_seconds:rate1m — CPU usage by instance
namespace:pod_restarts:increase24h — Pod restarts by namespace over 24h

Recording Rule Trade-offs

Recording rules consume storage (new time series) and computation (rule evaluation). Don't pre-compute everything—focus on expensive queries, frequently used aggregations, and metrics needed for alerts. A good rule of thumb: if a query takes >500ms or appears in multiple dashboards, consider a recording rule.

High Availability and Federation

Prometheus was designed as a single-server system with local storage. For production reliability and scale, you need to understand HA patterns and federation.

High Availability Pattern:

The simplest HA approach is running two identical Prometheus instances scraping the same targets. Both instances have complete data, and you query either one (behind a load balancer).

Converting Mermaid diagram...

Important considerations for HA:

Both instances scrape independently—there's no data sharing
Small timing differences mean data isn't perfectly identical
Alertmanager handles alert deduplication from both instances
For queries, use any instance (or both behind a load balancer)

Federation: Hierarchical Prometheus:

Federation allows a Prometheus server to scrape selected time series from another Prometheus server. This enables hierarchical architectures:

federation_config.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Global Prometheus configuration - scrapes from regional instances
scrape_configs:
  - job_name: 'federate-region-us-east'
    honor_labels: true
    metrics_path: '/federate'
    params:
      # Only pull aggregated metrics (recording rules)
      match[]:
        - '{__name__=~"job:.*"}'
        - '{__name__=~"node:.*"}'
    static_configs:
      - targets:
        - 'prometheus-us-east.example.com:9090'
    
  - job_name: 'federate-region-us-west'
    honor_labels: true
    metrics_path: '/federate'
    params:
      match[]:
        - '{__name__=~"job:.*"}'
        - '{__name__=~"node:.*"}'
    static_configs:
      - targets:
        - 'prometheus-us-west.example.com:9090'
 
  - job_name: 'federate-region-eu'
    honor_labels: true
    metrics_path: '/federate'
    params:
      match[]:
        - '{__name__=~"job:.*"}'
        - '{__name__=~"node:.*"}'
    static_configs:
      - targets:
        - 'prometheus-eu.example.com:9090'

Federation Use Cases:

Cross-datacenter visibility: Global Prometheus aggregates from regional instances
Team isolation: Team Prometheus instances federated into platform view
Long-term storage: Federate to an instance with longer retention

Thanos / Cortex / Mimir:

For enterprise scale, consider Prometheus-compatible systems that add:

Global querying: Query across multiple Prometheus instances
Long-term storage: Object storage (S3, GCS) for extended retention
High availability: Writes replicated to multiple nodes
Horizontal scaling: Ingest millions of series

Thanos, Cortex, and Mimir all maintain PromQL compatibility while adding these enterprise features.

Federation Best Practice

Only federate recording rules (pre-aggregated data), not raw metrics. Federating raw data defeats the purpose—you'd recreate scalability bottlenecks at the global level. Use naming conventions (job:, node:) to easily match aggregated metrics.

When to Use (and Not Use) Prometheus

Prometheus is powerful but not universal. Understanding its strengths and limitations helps you choose the right tool.

Prometheus Excels At

•Service monitoring and alerting
•Cloud-native / Kubernetes environments
•Pull-based operational metrics
•High-cardinality by design (labels)
•Short-to-medium term retention
•Infrastructure and application metrics
•When PromQL is desired

Prometheus Not Ideal For

•100% accuracy billing/financial data
•Event logging (use logs/tracing)
•Very long-term retention (years)
•Push-only environments
•Non-numeric data
•Unbounded cardinality (user IDs)
•When you can't expose HTTP endpoints

Key Limitations:

Not for events: Prometheus samples at intervals; it can miss short-lived events. For individual request logging, use logs or traces.
Not precisely accurate: The pull model samples at intervals, so brief spikes between scrapes may be missed.
Single-node design: Local storage means scaling requires federation or Thanos/Cortex.
Pull requires reachability: Prometheus must reach targets. For IoT devices behind NAT, consider pushing to Pushgateway or using remote_write.

Choosing Between Systems:

Requirement	Prometheus	InfluxDB	Datadog	CloudWatch
Self-hosted	✅	✅	❌ SaaS	❌ SaaS
Kubernetes-native	✅	⚠️	⚠️	⚠️
Long-term storage	⚠️ (Thanos)	✅	✅	✅
Cost (at scale)	Low	Medium	High	Medium
Operational burden	Medium	Medium	Low	Low
Query language	PromQL	Flux/InfluxQL	Custom	Custom

The 80/20 Rule

Prometheus handles 80% of metrics use cases exceptionally well. For the other 20% (billing accuracy, years of retention, event-level data), combine Prometheus with specialized systems. Most organizations run Prometheus alongside CloudWatch, Datadog, or custom solutions for specific needs.

Summary: Prometheus Architecture Mastery

Prometheus's architecture choices—pull-based collection, service discovery, local TSDB, and the powerful PromQL language—make it the standard for cloud-native metrics. Understanding these components enables effective instrumentation, querying, and scaling.

Key Takeaways

•Pull-based collection enables health detection, centralized control, and easier development. Targets expose /metrics endpoints; Prometheus scrapes them periodically.
•Service discovery automatically finds targets in dynamic environments. Relabeling transforms metadata into useful labels.
•The TSDB uses specialized compression (1-2 bytes/sample), WAL for durability, and block compaction for efficiency. Plan for ~1-2KB memory per active series.
•PromQL is a functional query language supporting instant/range vectors, aggregations, and histogram percentiles. Master rate(), histogram_quantile(), and aggregation operators.
•Recording rules pre-compute expensive queries, improving dashboard performance and alerting reliability.
•HA requires running pairs of identical Prometheus instances. For enterprise scale, consider Thanos or Cortex.
•Prometheus isn't universal—use it for service metrics, not billing data or event logging. Combine with other tools as needed.

What's Next:

With Prometheus architecture understood, the next page focuses on metric naming conventions. Proper naming is critical for query consistency, dashboard reuse, and maintainable observability infrastructure. We'll cover standard conventions, anti-patterns, and how to design metric names that scale with your organization.

Page Complete

You now understand Prometheus's core architecture and how its components work together for scalable metrics collection. This knowledge forms the foundation for configuring, querying, and scaling Prometheus in production environments.

Prometheus Architecture

The Pull-Based Revolution

Understanding Prometheus architecture isn't just about learning one tool—it's about understanding the design principles that govern modern observability systems.

What You Will Learn

Pull vs Push: Architectural Philosophy

Why Pull Works Better for Service Monitoring:

The pull model provides several operational advantages that may not be immediately obvious:

Pull vs Push Model Comparison
Dimension	Pull Model (Prometheus)	Push Model (Traditional)
Target Health	Instantly detect down targets (scrape fails)	Silent failure if target stops pushing
Discovery	Central discovery of all targets	Targets must know where to push
Rate Control	Server controls scrape rate	Clients can overwhelm server
Development	Easy local testing (just expose endpoint)	Need running collector to test
Configuration	Centralized at Prometheus server	Distributed across all clients
Network	Server initiates connections (easier firewall)	All clients need outbound access

The Scrape Model:

At its core, Prometheus performs this loop for each target:

Discover targets via service discovery or static configuration
Scrape the /metrics endpoint via HTTP(S)
Parse the exposition format (typically OpenMetrics or Prometheus text format)
Add timestamps and labels to each metric sample
Store in the time-series database
Wait for the next scrape interval (typically 15-60 seconds)

When Push Makes Sense

Core Components

Prometheus's architecture consists of several interconnected components, each with distinct responsibilities:

Converting Mermaid diagram...

Prometheus Server Components

•Service Discovery — Automatically discovers scrape targets from Kubernetes, Consul, EC2, DNS, file-based configs, and more. Removes manual target management.
•Scraper — Periodically fetches metrics from discovered targets. Handles retries, timeouts, and tracks target health.
•Time Series Database (TSDB) — Custom storage engine optimized for time-series data. Provides efficient compression and fast queries.
•Rules Engine — Evaluates recording rules (pre-compute expensive queries) and alerting rules (define alert conditions).
•Query Engine — Executes PromQL queries against the TSDB. Powers both ad-hoc queries and dashboard visualizations.
•HTTP API — Exposes query, metadata, and administrative endpoints. Consumed by Grafana, Alertmanager, and other integrations.

External Components:

Alertmanager — Handles alert deduplication, grouping, routing, and notification delivery (email, Slack, PagerDuty, etc.)
Pushgateway — Allows short-lived jobs to push metrics, which Prometheus then scrapes
Exporters — Bridge applications that can't be directly instrumented (MySQL, Redis, Linux system metrics)
Client Libraries — Instrument your applications (Go, Java, Python, Ruby, .NET, and more)

Service Discovery

Service Discovery Mechanisms:

Prometheus supports numerous discovery mechanisms out of the box:

Service Discovery Integrations
Type	Use Case	Key Information Discovered
kubernetes_sd	K8s pods, services, endpoints	Pod name, namespace, labels, annotations
consul_sd	Consul service catalog	Service name, tags, node, datacenter
ec2_sd	AWS EC2 instances	Instance ID, tags, availability zone
gce_sd	Google Compute Engine	Instance name, labels, zone, project
dns_sd	DNS SRV/A records	Hostnames, ports from DNS
file_sd	File-based (JSON/YAML)	Manually managed target lists
azure_sd	Azure VMs	VM name, resource group, tags
openstack_sd	OpenStack instances	Instance name, flavor, metadata

prometheus.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Prometheus configuration with service discovery
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
# Rule files
rule_files:
  - '/etc/prometheus/rules/*.yml'
 
# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  # Kubernetes pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # Add pod labels as metric labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      
      # Add namespace and pod name labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
 
  # Kubernetes services for endpoint discovery
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_name]
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Relabeling: The Secret Weapon

Relabeling is Prometheus's powerful mechanism for transforming discovered metadata. It enables:

Filtering targets — Keep or drop based on labels (e.g., only scrape pods with specific annotations)
Modifying labels — Rename, copy, or transform labels
Setting scrape parameters — Dynamic paths, ports, or schemes based on metadata
Dropping labels — Remove high-cardinality or sensitive labels before storage

Relabeling happens in three phases:

relabel_configs — Before scraping (affects which targets are scraped)
metric_relabel_configs — After scraping (affects which metrics are stored)
alert_relabel_configs — Before sending to Alertmanager

Service Discovery Best Practice

The Time Series Database (TSDB)

Prometheus includes a custom time-series database designed specifically for metrics workloads. Understanding its architecture helps you tune performance and plan capacity.

The Data Model:

Every Prometheus time series is uniquely identified by a metric name and a set of key-value labels:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Internally, this becomes a series identifier that indexes a sequence of (timestamp, value) pairs. The combination of metric name and labels is fingerprinted into a unique series ID.

Storage Architecture:

Prometheus TSDB organizes data into blocks:

data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/   # Block (2 hours of data)
│   ├── chunks/                    # Compressed sample data
│   │   └── 000001
│   ├── index                      # Label indexes for fast lookup
│   ├── meta.json                  # Block metadata
│   └── tombstones                 # Deleted series markers
├── 01BKGTZQ1SYQJTR4PB43C8PD98/   # Another block
├── 01BKGTZQ1XYQJTR4PB43C8PD99/   # Another block
├── lock
└── wal/                           # Write-ahead log (recent data)
    ├── 00000002
    └── checkpoint.00000001

Key Concepts:

Write-Ahead Log (WAL): Incoming samples first go to the WAL for durability. If Prometheus crashes, the WAL replays on startup.
Head Block: The most recent 2 hours of data, kept in memory for fast writes and queries.
Compaction: Older blocks are periodically merged and compressed into larger blocks, improving query efficiency.
Retention: Data older than the retention period (default: 15 days) is deleted.

TSDB Performance Characteristics
Operation	Performance	Factors Affecting
Write (ingest)	100K+ samples/second	Disk I/O, series churn
Query (instant)	<100ms typical	Series count, time range
Query (range)	Variable	Series count, duration, step
Compaction	Background, low priority	Block count, disk I/O
Memory usage	~1-2KB per active series	Series count, scrape interval
Disk usage	~1-2 bytes per sample	Compression ratio, retention

Compression:

Prometheus uses specialized compression for time-series data:

Timestamps: Delta-of-delta encoding (timestamps usually increment by fixed intervals)
Values: XOR encoding (consecutive values are often similar)
Achieved compression: Typically 1-2 bytes per sample (vs. 16 bytes uncompressed)

This aggressive compression enables Prometheus to store weeks of data for millions of time series on modest hardware.

Capacity Planning Formula:

disk_space = retention_seconds × samples_per_second × bytes_per_sample

Examples:
- 10000 series, 15s interval, 15 days retention, 1.5 bytes/sample
- samples/second = 10000 / 15 = 667
- disk = 15 × 24 × 3600 × 667 × 1.5 = ~1.3 GB

Series Churn

PromQL Query Language

PromQL (Prometheus Query Language) is a powerful functional language for querying time series data. It's the foundation for dashboards, alerts, and ad-hoc analysis.

Query Types:

PromQL supports four data types:

Instant vector: Set of time series, single sample per series at one timestamp
Range vector: Set of time series, range of samples per series
Scalar: Simple numeric floating point value
String: Simple string value (rarely used)

promql_examples.txt

PromQL

# SELECTORS
# -----------------------------------------
 
# Select all series for a metric
http_requests_total
 
# Filter by label equality
http_requests_total{method="GET"}
 
# Filter by regex match
http_requests_total{endpoint=~"/api/.*"}
 
# Negative match
http_requests_total{status!="200"}
 
# Multiple conditions (AND)
http_requests_total{method="GET", status="200"}
 
 
# FUNCTIONS ON COUNTERS
# -----------------------------------------
 
# Rate: per-second rate over 5 minutes
rate(http_requests_total[5m])
 
# Increase: total increase over 1 hour
increase(http_requests_total[1h])
 
# irate: instant rate (last two samples)
irate(http_requests_total[5m])
 
 
# AGGREGATIONS
# -----------------------------------------
 
# Sum across all series
sum(rate(http_requests_total[5m]))
 
# Sum grouped by method
sum by (method) (rate(http_requests_total[5m]))
 
# Sum excluding specific labels
sum without (instance, pod) (rate(http_requests_total[5m]))
 
# Average, min, max, count
avg(rate(http_requests_total[5m]))
max by (endpoint) (rate(http_requests_total[5m]))
count(rate(http_requests_total[5m]) > 100)
 
 
# HISTOGRAMS
# -----------------------------------------
 
# Calculate p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
 
# Aggregate across instances, then calculate percentile
histogram_quantile(0.95, 
  sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# Average request duration
rate(http_request_duration_seconds_sum[5m]) 
  / 
rate(http_request_duration_seconds_count[5m])
 
 
# OPERATORS
# -----------------------------------------
 
# Arithmetic
http_requests_total * 2
node_memory_used_bytes / node_memory_total_bytes * 100
 
# Comparison (filtering)
rate(http_requests_total[5m]) > 100
 
# Boolean comparison (returns 0 or 1)
rate(http_requests_total[5m]) > bool 100
 
# Vector matching
rate(http_requests_total[5m]) 
  / 
on(endpoint) group_left(team)
endpoint_ownership
 
 
# ADVANCED
# -----------------------------------------
 
# Absent: returns 1 if no series exist
absent(up{job="api"})
 
# Predict: linear prediction of value
predict_linear(node_disk_free_bytes[1h], 4 * 3600)
 
# Changes: number of value changes in range
changes(process_start_time_seconds[1d])
 
# Label manipulation
label_replace(up, "dc", "$1", "instance", "([^:]+):.*")

Common Query Patterns:

Essential PromQL Patterns
Pattern	Query	Use Case
Request rate	sum(rate(http_requests_total[5m]))	Throughput monitoring
Error rate	sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))	Error percentage
Latency p99	histogram_quantile(0.99, sum by(le)(rate(http_request_duration_seconds_bucket[5m])))	SLO tracking
Saturation	sum(rate(container_cpu_usage_seconds_total[5m])) / sum(machine_cpu_cores)	Resource usage
Up time	avg_over_time(up[1d])	Availability calculation

rate() vs irate()

Recording Rules

Why Recording Rules Matter:

Query performance: Pre-aggregated data queries faster
Dashboard consistency: Multiple dashboards use the same computed metrics
Reduced load: Expensive queries run once per evaluation, not per dashboard load
Simplified alerting: Alert on precomputed values instead of complex expressions

recording_rules.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Recording rules for API metrics
groups:
  - name: api_metrics
    interval: 30s  # Evaluation interval
    rules:
      # Request rate by endpoint (used in many dashboards)
      - record: job:http_requests:rate5m
        expr: sum by (job, endpoint) (rate(http_requests_total[5m]))
 
      # Error ratio by service
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
 
      # P99 latency by endpoint (expensive to compute)
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )
 
      # P50 latency by endpoint
      - record: job:http_request_duration:p50
        expr: |
          histogram_quantile(0.50,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )
 
  - name: infrastructure_metrics
    interval: 1m
    rules:
      # CPU usage per node
      - record: node:cpu_usage:ratio1m
        expr: |
          1 - avg by (node) (
            rate(node_cpu_seconds_total{mode="idle"}[1m])
          )
 
      # Memory usage per node
      - record: node:memory_usage:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          )
 
      # Pod restart rate (signals instability)
      - record: namespace:pod_restarts:increase1h
        expr: |
          sum by (namespace) (
            increase(kube_pod_container_status_restarts_total[1h])
          )

Naming Convention:

Prometheus recommends a consistent naming pattern for recording rules:

level:metric:operations

Where:

level: The aggregation level (job, instance, namespace, etc.)
metric: The metric name being aggregated
operations: List of operations applied (rate5m, ratio, p99, etc.)

Examples:

job:http_requests:rate5m — Requests per second by job
instance:cpu_seconds:rate1m — CPU usage by instance
namespace:pod_restarts:increase24h — Pod restarts by namespace over 24h

Recording Rule Trade-offs

High Availability and Federation

Prometheus was designed as a single-server system with local storage. For production reliability and scale, you need to understand HA patterns and federation.

High Availability Pattern:

The simplest HA approach is running two identical Prometheus instances scraping the same targets. Both instances have complete data, and you query either one (behind a load balancer).

Converting Mermaid diagram...

Important considerations for HA:

Both instances scrape independently—there's no data sharing
Small timing differences mean data isn't perfectly identical
Alertmanager handles alert deduplication from both instances
For queries, use any instance (or both behind a load balancer)

Federation: Hierarchical Prometheus:

Federation allows a Prometheus server to scrape selected time series from another Prometheus server. This enables hierarchical architectures:

federation_config.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Global Prometheus configuration - scrapes from regional instances
scrape_configs:
  - job_name: 'federate-region-us-east'
    honor_labels: true
    metrics_path: '/federate'
    params:
      # Only pull aggregated metrics (recording rules)
      match[]:
        - '{__name__=~"job:.*"}'
        - '{__name__=~"node:.*"}'
    static_configs:
      - targets:
        - 'prometheus-us-east.example.com:9090'
    
  - job_name: 'federate-region-us-west'
    honor_labels: true
    metrics_path: '/federate'
    params:
      match[]:
        - '{__name__=~"job:.*"}'
        - '{__name__=~"node:.*"}'
    static_configs:
      - targets:
        - 'prometheus-us-west.example.com:9090'
 
  - job_name: 'federate-region-eu'
    honor_labels: true
    metrics_path: '/federate'
    params:
      match[]:
        - '{__name__=~"job:.*"}'
        - '{__name__=~"node:.*"}'
    static_configs:
      - targets:
        - 'prometheus-eu.example.com:9090'

Federation Use Cases:

Cross-datacenter visibility: Global Prometheus aggregates from regional instances
Team isolation: Team Prometheus instances federated into platform view
Long-term storage: Federate to an instance with longer retention

Thanos / Cortex / Mimir:

For enterprise scale, consider Prometheus-compatible systems that add:

Global querying: Query across multiple Prometheus instances
Long-term storage: Object storage (S3, GCS) for extended retention
High availability: Writes replicated to multiple nodes
Horizontal scaling: Ingest millions of series

Thanos, Cortex, and Mimir all maintain PromQL compatibility while adding these enterprise features.

Federation Best Practice

When to Use (and Not Use) Prometheus

Prometheus is powerful but not universal. Understanding its strengths and limitations helps you choose the right tool.

Prometheus Excels At

•Service monitoring and alerting
•Cloud-native / Kubernetes environments
•Pull-based operational metrics
•High-cardinality by design (labels)
•Short-to-medium term retention
•Infrastructure and application metrics
•When PromQL is desired

Prometheus Not Ideal For

•100% accuracy billing/financial data
•Event logging (use logs/tracing)
•Very long-term retention (years)
•Push-only environments
•Non-numeric data
•Unbounded cardinality (user IDs)
•When you can't expose HTTP endpoints

Key Limitations:

Not for events: Prometheus samples at intervals; it can miss short-lived events. For individual request logging, use logs or traces.
Not precisely accurate: The pull model samples at intervals, so brief spikes between scrapes may be missed.
Single-node design: Local storage means scaling requires federation or Thanos/Cortex.
Pull requires reachability: Prometheus must reach targets. For IoT devices behind NAT, consider pushing to Pushgateway or using remote_write.

Choosing Between Systems:

Requirement	Prometheus	InfluxDB	Datadog	CloudWatch
Self-hosted	✅	✅	❌ SaaS	❌ SaaS
Kubernetes-native	✅	⚠️	⚠️	⚠️
Long-term storage	⚠️ (Thanos)	✅	✅	✅
Cost (at scale)	Low	Medium	High	Medium
Operational burden	Medium	Medium	Low	Low
Query language	PromQL	Flux/InfluxQL	Custom	Custom

The 80/20 Rule

Summary: Prometheus Architecture Mastery

Key Takeaways

•Pull-based collection enables health detection, centralized control, and easier development. Targets expose /metrics endpoints; Prometheus scrapes them periodically.
•Service discovery automatically finds targets in dynamic environments. Relabeling transforms metadata into useful labels.
•The TSDB uses specialized compression (1-2 bytes/sample), WAL for durability, and block compaction for efficiency. Plan for ~1-2KB memory per active series.
•PromQL is a functional query language supporting instant/range vectors, aggregations, and histogram percentiles. Master rate(), histogram_quantile(), and aggregation operators.
•Recording rules pre-compute expensive queries, improving dashboard performance and alerting reliability.
•HA requires running pairs of identical Prometheus instances. For enterprise scale, consider Thanos or Cortex.
•Prometheus isn't universal—use it for service metrics, not billing data or event logging. Combine with other tools as needed.

What's Next:

Page Complete