Loading content...
When Prometheus emerged from SoundCloud in 2012, it challenged the dominant paradigm of metrics collection. Rather than having applications push metrics to a central collector, Prometheus pulls metrics from applications. This seemingly simple architectural choice has profound implications for reliability, discovery, and operational simplicity.
Prometheus has since become the de facto standard for cloud-native metrics collection, with adoption spanning from startups to the largest technology companies. Its influence is so significant that the Cloud Native Computing Foundation (CNCF) graduated Prometheus as only its second project after Kubernetes.
Understanding Prometheus architecture isn't just about learning one tool—it's about understanding the design principles that govern modern observability systems.
By the end of this page, you will understand Prometheus's core architecture: the pull model, service discovery, the time-series database (TSDB), PromQL query language, and how these components work together. You'll also learn about federation, high availability patterns, and when Prometheus is (or isn't) the right choice.
The most distinctive architectural decision in Prometheus is its pull-based collection model. Traditional monitoring systems like Graphite use a push model—applications send metrics to the monitoring system. Prometheus inverts this: the server periodically scrapes metrics from targets.
Why Pull Works Better for Service Monitoring:
The pull model provides several operational advantages that may not be immediately obvious:
| Dimension | Pull Model (Prometheus) | Push Model (Traditional) |
|---|---|---|
| Target Health | Instantly detect down targets (scrape fails) | Silent failure if target stops pushing |
| Discovery | Central discovery of all targets | Targets must know where to push |
| Rate Control | Server controls scrape rate | Clients can overwhelm server |
| Development | Easy local testing (just expose endpoint) | Need running collector to test |
| Configuration | Centralized at Prometheus server | Distributed across all clients |
| Network | Server initiates connections (easier firewall) | All clients need outbound access |
The Scrape Model:
At its core, Prometheus performs this loop for each target:
/metrics endpoint via HTTP(S)This model means that your application only needs to expose an HTTP endpoint that returns current metric values. No client library configuration for endpoints, no network issues reaching collectors, no concerns about buffering metrics during outages.
Pull isn't universally better. For short-lived jobs (batch processes, serverless functions), Prometheus provides the Pushgateway—a intermediary that accepts pushed metrics. Push also works better across network boundaries where Prometheus can't reach targets. However, for long-running services in modern orchestration environments, pull dominates.
Prometheus's architecture consists of several interconnected components, each with distinct responsibilities:
External Components:
In dynamic environments like Kubernetes, service instances come and go constantly. Static configuration doesn't scale. Prometheus addresses this with powerful service discovery integrations that automatically find and track targets.
Service Discovery Mechanisms:
Prometheus supports numerous discovery mechanisms out of the box:
| Type | Use Case | Key Information Discovered |
|---|---|---|
| kubernetes_sd | K8s pods, services, endpoints | Pod name, namespace, labels, annotations |
| consul_sd | Consul service catalog | Service name, tags, node, datacenter |
| ec2_sd | AWS EC2 instances | Instance ID, tags, availability zone |
| gce_sd | Google Compute Engine | Instance name, labels, zone, project |
| dns_sd | DNS SRV/A records | Hostnames, ports from DNS |
| file_sd | File-based (JSON/YAML) | Manually managed target lists |
| azure_sd | Azure VMs | VM name, resource group, tags |
| openstack_sd | OpenStack instances | Instance name, flavor, metadata |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
# Prometheus configuration with service discoveryglobal: scrape_interval: 15s evaluation_interval: 15s # Alertmanager configurationalerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] # Rule filesrule_files: - '/etc/prometheus/rules/*.yml' # Scrape configurationsscrape_configs: # Prometheus self-monitoring - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Kubernetes pods with prometheus.io annotations - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: # Only scrape pods with prometheus.io/scrape: "true" - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true # Use custom path if specified - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # Use custom port if specified - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ # Add pod labels as metric labels - action: labelmap regex: __meta_kubernetes_pod_label_(.+) # Add namespace and pod name labels - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod # Kubernetes services for endpoint discovery - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_name] target_label: service - source_labels: [__meta_kubernetes_namespace] target_label: namespaceRelabeling: The Secret Weapon
Relabeling is Prometheus's powerful mechanism for transforming discovered metadata. It enables:
Relabeling happens in three phases:
Use annotations on your Kubernetes pods to control scraping: prometheus.io/scrape: 'true', prometheus.io/port: '8080', prometheus.io/path: '/metrics'. This makes applications self-describing—Prometheus discovers everything automatically without central configuration changes.
Prometheus includes a custom time-series database designed specifically for metrics workloads. Understanding its architecture helps you tune performance and plan capacity.
The Data Model:
Every Prometheus time series is uniquely identified by a metric name and a set of key-value labels:
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Internally, this becomes a series identifier that indexes a sequence of (timestamp, value) pairs. The combination of metric name and labels is fingerprinted into a unique series ID.
Storage Architecture:
Prometheus TSDB organizes data into blocks:
data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/ # Block (2 hours of data)
│ ├── chunks/ # Compressed sample data
│ │ └── 000001
│ ├── index # Label indexes for fast lookup
│ ├── meta.json # Block metadata
│ └── tombstones # Deleted series markers
├── 01BKGTZQ1SYQJTR4PB43C8PD98/ # Another block
├── 01BKGTZQ1XYQJTR4PB43C8PD99/ # Another block
├── lock
└── wal/ # Write-ahead log (recent data)
├── 00000002
└── checkpoint.00000001
Key Concepts:
| Operation | Performance | Factors Affecting |
|---|---|---|
| Write (ingest) | 100K+ samples/second | Disk I/O, series churn |
| Query (instant) | <100ms typical | Series count, time range |
| Query (range) | Variable | Series count, duration, step |
| Compaction | Background, low priority | Block count, disk I/O |
| Memory usage | ~1-2KB per active series | Series count, scrape interval |
| Disk usage | ~1-2 bytes per sample | Compression ratio, retention |
Compression:
Prometheus uses specialized compression for time-series data:
This aggressive compression enables Prometheus to store weeks of data for millions of time series on modest hardware.
Capacity Planning Formula:
disk_space = retention_seconds × samples_per_second × bytes_per_sample
Examples:
- 10000 series, 15s interval, 15 days retention, 1.5 bytes/sample
- samples/second = 10000 / 15 = 667
- disk = 15 × 24 × 3600 × 667 × 1.5 = ~1.3 GB
Series churn occurs when label values constantly change (e.g., using request IDs as labels). Each unique label combination creates a new series. High churn defeats compression, bloats the index, and degrades performance. Avoid labels with unbounded cardinality.
PromQL (Prometheus Query Language) is a powerful functional language for querying time series data. It's the foundation for dashboards, alerts, and ad-hoc analysis.
Query Types:
PromQL supports four data types:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# SELECTORS# ----------------------------------------- # Select all series for a metrichttp_requests_total # Filter by label equalityhttp_requests_total{method="GET"} # Filter by regex matchhttp_requests_total{endpoint=~"/api/.*"} # Negative matchhttp_requests_total{status!="200"} # Multiple conditions (AND)http_requests_total{method="GET", status="200"} # FUNCTIONS ON COUNTERS# ----------------------------------------- # Rate: per-second rate over 5 minutesrate(http_requests_total[5m]) # Increase: total increase over 1 hourincrease(http_requests_total[1h]) # irate: instant rate (last two samples)irate(http_requests_total[5m]) # AGGREGATIONS# ----------------------------------------- # Sum across all seriessum(rate(http_requests_total[5m])) # Sum grouped by methodsum by (method) (rate(http_requests_total[5m])) # Sum excluding specific labelssum without (instance, pod) (rate(http_requests_total[5m])) # Average, min, max, countavg(rate(http_requests_total[5m]))max by (endpoint) (rate(http_requests_total[5m]))count(rate(http_requests_total[5m]) > 100) # HISTOGRAMS# ----------------------------------------- # Calculate p99 latencyhistogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Aggregate across instances, then calculate percentilehistogram_quantile(0.95, sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))) # Average request durationrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) # OPERATORS# ----------------------------------------- # Arithmetichttp_requests_total * 2node_memory_used_bytes / node_memory_total_bytes * 100 # Comparison (filtering)rate(http_requests_total[5m]) > 100 # Boolean comparison (returns 0 or 1)rate(http_requests_total[5m]) > bool 100 # Vector matchingrate(http_requests_total[5m]) / on(endpoint) group_left(team)endpoint_ownership # ADVANCED# ----------------------------------------- # Absent: returns 1 if no series existabsent(up{job="api"}) # Predict: linear prediction of valuepredict_linear(node_disk_free_bytes[1h], 4 * 3600) # Changes: number of value changes in rangechanges(process_start_time_seconds[1d]) # Label manipulationlabel_replace(up, "dc", "$1", "instance", "([^:]+):.*")Common Query Patterns:
| Pattern | Query | Use Case |
|---|---|---|
| Request rate | sum(rate(http_requests_total[5m])) | Throughput monitoring |
| Error rate | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) | Error percentage |
| Latency p99 | histogram_quantile(0.99, sum by(le)(rate(http_request_duration_seconds_bucket[5m]))) | SLO tracking |
| Saturation | sum(rate(container_cpu_usage_seconds_total[5m])) / sum(machine_cpu_cores) | Resource usage |
| Up time | avg_over_time(up[1d]) | Availability calculation |
rate() calculates the average rate over the entire range window, smoothing spikes. irate() uses only the last two samples, showing instantaneous rate. Use rate() for alerting (more stable) and irate() for graphs (shows spikes). Rule of thumb: rate() for dashboards and alerts; irate() for debugging.
Recording rules allow you to precompute frequently used or computationally expensive expressions and save them as new time series. This is critical for maintaining dashboard performance and reducing query load.
Why Recording Rules Matter:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# Recording rules for API metricsgroups: - name: api_metrics interval: 30s # Evaluation interval rules: # Request rate by endpoint (used in many dashboards) - record: job:http_requests:rate5m expr: sum by (job, endpoint) (rate(http_requests_total[5m])) # Error ratio by service - record: job:http_errors:ratio5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) # P99 latency by endpoint (expensive to compute) - record: job:http_request_duration:p99 expr: | histogram_quantile(0.99, sum by (job, endpoint, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) # P50 latency by endpoint - record: job:http_request_duration:p50 expr: | histogram_quantile(0.50, sum by (job, endpoint, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) - name: infrastructure_metrics interval: 1m rules: # CPU usage per node - record: node:cpu_usage:ratio1m expr: | 1 - avg by (node) ( rate(node_cpu_seconds_total{mode="idle"}[1m]) ) # Memory usage per node - record: node:memory_usage:ratio expr: | 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) # Pod restart rate (signals instability) - record: namespace:pod_restarts:increase1h expr: | sum by (namespace) ( increase(kube_pod_container_status_restarts_total[1h]) )Naming Convention:
Prometheus recommends a consistent naming pattern for recording rules:
level:metric:operations
Where:
Examples:
job:http_requests:rate5m — Requests per second by jobinstance:cpu_seconds:rate1m — CPU usage by instancenamespace:pod_restarts:increase24h — Pod restarts by namespace over 24hRecording rules consume storage (new time series) and computation (rule evaluation). Don't pre-compute everything—focus on expensive queries, frequently used aggregations, and metrics needed for alerts. A good rule of thumb: if a query takes >500ms or appears in multiple dashboards, consider a recording rule.
Prometheus was designed as a single-server system with local storage. For production reliability and scale, you need to understand HA patterns and federation.
High Availability Pattern:
The simplest HA approach is running two identical Prometheus instances scraping the same targets. Both instances have complete data, and you query either one (behind a load balancer).
Important considerations for HA:
Federation: Hierarchical Prometheus:
Federation allows a Prometheus server to scrape selected time series from another Prometheus server. This enables hierarchical architectures:
1234567891011121314151617181920212223242526272829303132333435
# Global Prometheus configuration - scrapes from regional instancesscrape_configs: - job_name: 'federate-region-us-east' honor_labels: true metrics_path: '/federate' params: # Only pull aggregated metrics (recording rules) match[]: - '{__name__=~"job:.*"}' - '{__name__=~"node:.*"}' static_configs: - targets: - 'prometheus-us-east.example.com:9090' - job_name: 'federate-region-us-west' honor_labels: true metrics_path: '/federate' params: match[]: - '{__name__=~"job:.*"}' - '{__name__=~"node:.*"}' static_configs: - targets: - 'prometheus-us-west.example.com:9090' - job_name: 'federate-region-eu' honor_labels: true metrics_path: '/federate' params: match[]: - '{__name__=~"job:.*"}' - '{__name__=~"node:.*"}' static_configs: - targets: - 'prometheus-eu.example.com:9090'Federation Use Cases:
Thanos / Cortex / Mimir:
For enterprise scale, consider Prometheus-compatible systems that add:
Thanos, Cortex, and Mimir all maintain PromQL compatibility while adding these enterprise features.
Only federate recording rules (pre-aggregated data), not raw metrics. Federating raw data defeats the purpose—you'd recreate scalability bottlenecks at the global level. Use naming conventions (job:, node:) to easily match aggregated metrics.
Prometheus is powerful but not universal. Understanding its strengths and limitations helps you choose the right tool.
Key Limitations:
Choosing Between Systems:
| Requirement | Prometheus | InfluxDB | Datadog | CloudWatch |
|---|---|---|---|---|
| Self-hosted | ✅ | ✅ | ❌ SaaS | ❌ SaaS |
| Kubernetes-native | ✅ | ⚠️ | ⚠️ | ⚠️ |
| Long-term storage | ⚠️ (Thanos) | ✅ | ✅ | ✅ |
| Cost (at scale) | Low | Medium | High | Medium |
| Operational burden | Medium | Medium | Low | Low |
| Query language | PromQL | Flux/InfluxQL | Custom | Custom |
Prometheus handles 80% of metrics use cases exceptionally well. For the other 20% (billing accuracy, years of retention, event-level data), combine Prometheus with specialized systems. Most organizations run Prometheus alongside CloudWatch, Datadog, or custom solutions for specific needs.
Prometheus's architecture choices—pull-based collection, service discovery, local TSDB, and the powerful PromQL language—make it the standard for cloud-native metrics. Understanding these components enables effective instrumentation, querying, and scaling.
What's Next:
With Prometheus architecture understood, the next page focuses on metric naming conventions. Proper naming is critical for query consistency, dashboard reuse, and maintainable observability infrastructure. We'll cover standard conventions, anti-patterns, and how to design metric names that scale with your organization.
You now understand Prometheus's core architecture and how its components work together for scalable metrics collection. This knowledge forms the foundation for configuring, querying, and scaling Prometheus in production environments.