System Design (HLD)Service Mesh

Service Mesh: Network Infrastructure for Microservices

LevelAdvanced

Duration90 mins

TopicService Mesh

5 / 5

Observability through Mesh

Making the Invisible Visible

In a monolithic application, understanding system behavior is relatively straightforward. A single process, a single log file, a single memory space to profile. But when a user request traverses ten microservices across three availability zones, questions become hard to answer:

Where is the latency? Which of the ten services is slow?
What's failing? Service C returns 503—but is it C's fault, or D's?
How are services connected? Which services call which?
What's the traffic pattern? How many requests per second per service?

Traditional observability requires instrumenting each service—adding tracing libraries, exposing metrics endpoints, configuring logging. This creates inconsistent coverage, maintenance burden, and gaps where instrumentation was forgotten.

Service mesh flips this model. Because every request flows through sidecar proxies, the mesh observes every interaction automatically. Metrics, traces, and logs emerge from infrastructure, not application code. Observability becomes a platform capability, not a per-service development task.

Learning Objectives

By the end of this page, you will understand how service mesh automatically generates metrics, traces, and access logs; the three pillars of observability and how mesh addresses each; service graph visualization and dependency mapping; integration with observability platforms (Prometheus, Jaeger, Grafana, Kiali); and debugging approaches enabled by mesh observability.

The Three Pillars of Observability

Modern observability practice centers on three complementary signal types, often called the "three pillars." Service mesh contributes to all three, but each serves different diagnostic purposes.

1. Metrics (What is happening?)

Metrics are numerical measurements aggregated over time—request counts, error rates, latency percentiles. They answer questions like "Is the system healthy?" and "How has performance trended?"

2. Traces (Why is it happening?)

Traces follow individual requests across service boundaries, showing the path, timing, and relationships. They answer "Why was this specific request slow?" and "What services were involved?"

3. Logs (What happened in detail?)

Logs are timestamped records of discrete events—errors, state changes, significant actions. They answer "What exactly occurred at 15:32:47?" and "What was the request payload?"

Observability Pillars Comparison
Aspect	Metrics	Traces	Logs
Data Type	Numeric, aggregated	Structured, per-request	Text/structured, per-event
Cardinality	Low (aggregated)	Medium (sampled)	High (every event)
Storage Cost	Low	Medium	High
Analysis Type	Statistical, trending	Causal, path analysis	Forensic, debugging
Typical Questions	"What's the error rate?"	"Why was this request slow?"	"What was the exact error message?"
Mesh Contribution	Automatic golden signals	Automatic trace propagation	Access logging per request

Mesh Observability Philosophy:

Service mesh observability is infrastructure-generated, not application-generated. The sidecar proxy sees every request and response—it generates telemetry as a byproduct of its routing function.

This provides:

Complete coverage: Every meshed service is observed, even if developers forget to add instrumentation.
Consistency: The same metrics format, same trace structure, same log schema across all services.
Zero application changes: No libraries to add, no code to write, no configurations to maintain per-service.

The trade-off: mesh observability is primarily network-level. It sees request/response patterns but not application internals. Deep profiling, business logic tracing, and application-specific metrics still require application instrumentation.

Mesh + App Observability = Complete Picture

Don't view mesh observability as replacing application observability. They're complementary. Mesh provides the network view (service-to-service communication). Applications provide the business view (user sessions, transactions, domain events). The complete picture requires both.

Automatic Metrics Generation

Sidecar proxies automatically generate metrics for every request they handle. These metrics follow standard formats (Prometheus exposition format) and capture the essential health indicators.

The Four Golden Signals:

Google SRE popularized the "Four Golden Signals" as the essential metrics for understanding service health. Service mesh proxies generate all four automatically:

Golden Signals from Service Mesh

•Latency: How long requests take. Proxies measure request duration including upstream latency. Available as histograms for percentile calculations (p50, p95, p99).
•Traffic: How many requests are flowing. Proxies count requests by service, version, method, and path. Shows load distribution and usage patterns.
•Errors: What's failing. Proxies categorize responses by status code (2xx, 4xx, 5xx) and track error rates per service and endpoint.
•Saturation: How full is the system. Proxies expose connection counts, pending requests, and queue depths indicating capacity utilization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Request count by source, destination, response code
istio_requests_total{
  source_workload="api-gateway",
  source_workload_namespace="gateway",
  destination_service="product-service.ecommerce.svc.cluster.local",
  destination_workload="product-service-v1",
  destination_version="v1",
  request_protocol="http",
  response_code="200",
  response_flags="-",
  connection_security_policy="mutual_tls"
} 1542367
 
# Request duration histogram
istio_request_duration_milliseconds_bucket{
  source_workload="api-gateway",
  destination_service="product-service.ecommerce.svc.cluster.local",
  request_protocol="http",
  response_code="200",
  le="10"    # <= 10ms bucket
} 1234567
 
istio_request_duration_milliseconds_bucket{
  ...
  le="50"    # <= 50ms bucket
} 1498765
 
istio_request_duration_milliseconds_bucket{
  ...
  le="100"   # <= 100ms bucket
} 1532456
 
# TCP connection metrics
istio_tcp_connections_opened_total{
  source_workload="order-service",
  destination_service="database.ecommerce.svc.cluster.local"
} 45678
 
istio_tcp_connections_closed_total{...} 45234
 
# Request/response byte counts
istio_request_bytes_sum{...} 234567890
istio_response_bytes_sum{...} 987654321
 
# Linkerd equivalent metrics (similar structure)
request_total{
  deployment="product-service",
  direction="inbound",
  tls="true",
  status_code="200"
} 1542367
 
response_latency_ms_bucket{
  deployment="product-service",
  direction="inbound",
  le="100"
} 1498765

Metric Labels (Dimensions):

Metrics are dimensioned by labels that enable powerful filtering and aggregation:

Label Category	Example Labels	Analysis Use
Source identity	source_workload, source_namespace, source_principal	"Who is calling?"
Destination identity	destination_service, destination_version	"What are they calling?"
Protocol/Response	request_protocol, response_code, response_flags	"How did it go?"
Security	connection_security_policy, source_principal	"Was it encrypted? Authenticated?"

These dimensions enable queries like:

Error rate for product-service v2 from api-gateway
P99 latency for all production namespace services
Traffic breakdown by response code for order-service

Cardinality Concerns

High cardinality metrics (unique label combinations) consume significant memory in Prometheus and increase storage costs. Avoid adding dynamic labels like user IDs or request IDs to mesh metrics. Istio and Linkerd carefully curate their default metric labels for sustainable cardinality.

Distributed Tracing

While metrics tell you what is happening (aggregate behavior), traces tell you why specific requests behaved as they did. A trace follows a single request through the entire system, showing every service involved, the timing of each span, and where delays occurred.

Trace Anatomy:

A trace consists of spans—units of work with start time, duration, and context. Spans form a tree reflecting the call hierarchy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
                    DISTRIBUTED TRACE: Order Creation Request
 
Trace ID: 7b3f2a1c9d4e5f6a8b7c6d5e4f3a2b1c
Total Duration: 247ms
 
│ Timeline (ms)     0    50    100   150   200   250
│                   │     │     │     │     │     │
│
├─ api-gateway      ████████████████████████████████████████  (247ms)
│  └─ Parse request   █                                        (3ms)
│  └─ Auth check        ███                                    (12ms)
│  └─ Call order-service   ██████████████████████████████████  (228ms)
│
├─ order-service                 ██████████████████████████████ (220ms)
│  └─ Validate order               ██                          (8ms)
│  └─ Reserve inventory               ██████████████           (75ms)
│  └─ Process payment                              ████████████ (95ms)
│  └─ Send confirmation                                      ██ (12ms)
│
├─ inventory-service                    ██████████████          (75ms)
│  └─ Check stock                         ████                  (23ms)
│  └─ Reserve units                           ████████          (48ms)
│
├─ payment-service                                  ████████████ (95ms)
│  └─ Validate card                                   ███       (18ms)
│  └─ Charge card (external API)                        ██████  (65ms)
│  └─ Record transaction                                     ██ (8ms)
│
└─ notification-service                                       ██ (12ms)
   └─ Send email                                               █ (10ms)
 
 
SPAN DETAILS: payment-service / Charge card
┌────────────────────────────────────────────────────────────┐
│ Span ID:     a1b2c3d4e5f6                                  │
│ Parent Span: 9f8e7d6c5b4a (order-service/Process payment)  │
│ Start Time:  2024-01-15T14:32:47.142Z                      │
│ Duration:    65ms                                          │
│ Status:      OK                                            │
│                                                            │
│ Attributes:                                                │
│   http.method: POST                                        │
│   http.url: https://api.stripe.com/v1/charges              │
│   http.status_code: 200                                    │
│   payment.amount: 99.95                                    │
│   payment.currency: USD                                    │
│   retry.count: 0                                           │
└────────────────────────────────────────────────────────────┘

How Mesh Enables Tracing:

Service mesh proxies automatically participate in distributed tracing by:

Generating spans: Each proxy creates spans for requests it handles—both inbound (receiving) and outbound (forwarding).
Propagating context: Proxies copy trace headers (like x-request-id, x-b3-traceid, or W3C traceparent) from incoming requests to outgoing requests, maintaining trace continuity.
Exporting spans: Proxies send span data to trace collectors (Jaeger, Zipkin, Lightstep, Datadog) where they're assembled into complete traces.

The Application Participation Requirement:

Here's an important nuance: while proxies handle span creation and export, applications must propagate trace headers for traces to connect properly.

If Service A calls Service B:

A's outbound proxy adds trace headers to the request
B's inbound proxy sees the headers and creates a child span
B's application receives the request (headers included)
If B calls Service C, B must include those trace headers in the outbound request
Otherwise, C appears as a disconnected trace

This header propagation is the minimal application responsibility. Most frameworks have one-line middleware for this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Express.js middleware for trace header propagation
const TRACE_HEADERS = [
  'x-request-id',
  'x-b3-traceid',
  'x-b3-spanid',
  'x-b3-parentspanid',
  'x-b3-sampled',
  'x-b3-flags',
  'x-ot-span-context',
  'traceparent',      // W3C Trace Context
  'tracestate',
];
 
// Middleware to capture trace headers
app.use((req, res, next) => {
  // Store trace headers in request context
  req.traceHeaders = {};
  TRACE_HEADERS.forEach(header => {
    if (req.headers[header]) {
      req.traceHeaders[header] = req.headers[header];
    }
  });
  next();
});
 
// When making outbound calls, include trace headers
async function callInventoryService(orderId, traceHeaders) {
  return await fetch(`http://inventory-service/check/${orderId}`, {
    method: 'GET',
    headers: {
      'Content-Type': 'application/json',
      ...traceHeaders,  // Propagate trace headers
    },
  });
}
 
// Usage in route handler
app.post('/orders', async (req, res) => {
  const order = req.body;
  
  // Pass trace headers to downstream call
  const inventoryCheck = await callInventoryService(
    order.id, 
    req.traceHeaders  // <-- Propagate!
  );
  
  // Continue processing...
});

Sampling Reduces Cost

Tracing every request generates enormous data volume. Production systems typically sample 1-10% of requests. Mesh configuration controls sampling rate. For debugging specific issues, you can increase sampling temporarily or trace specific requests on-demand using headers.

Access Logging

Access logs provide detailed records of individual requests—the who, what, when, and how of every interaction. Unlike aggregated metrics, access logs preserve individual request details for forensic analysis.

What Access Logs Capture:

Mesh access logs typically include:

Request details: method, path, protocol, headers
Response details: status code, body size, response flags
Timing: request start, response time, total duration
Identity: source and destination service identifiers
Security: TLS version, certificate information, authorization decision

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "start_time": "2024-01-15T14:32:47.142Z",
  "request_method": "POST",
  "path": "/api/v1/orders",
  "protocol": "HTTP/2",
  "response_code": 201,
  "response_flags": "-",
  "response_code_details": "via_upstream",
  "connection_termination_details": "-",
  "upstream_transport_failure_reason": "-",
  "bytes_received": 1024,
  "bytes_sent": 256,
  "duration": 247,
  "x_envoy_upstream_service_time": 245,
  "x_forwarded_for": "10.0.0.1",
  "user_agent": "Mozilla/5.0...",
  "x_request_id": "7b3f2a1c-9d4e-5f6a-8b7c-6d5e4f3a2b1c",
  "authority": "api-gateway.ecommerce.svc.cluster.local",
  "upstream_host": "10.1.2.3:8080",
  "upstream_cluster": "outbound|8080||order-service.ecommerce.svc.cluster.local",
  "upstream_local_address": "10.0.0.5:45678",
  "downstream_local_address": "10.0.0.5:8080",
  "downstream_remote_address": "10.0.0.1:54321",
  "requested_server_name": "api-gateway.ecommerce.svc.cluster.local",
  "route_name": "order-api-route",
  "traceId": "7b3f2a1c9d4e5f6a8b7c6d5e4f3a2b1c"
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Istio: Enable access logging via Telemetry API
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: access-logging
  namespace: ecommerce
spec:
  accessLogging:
    - providers:
        - name: envoy
      # Disable for specific workloads
      disabled: false
      
      # Filter to log only errors (reduce volume)
      filter:
        expression: "response.code >= 400"
 
---
# Custom log format via EnvoyFilter
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: custom-access-log
  namespace: istio-system
spec:
  configPatches:
    - applyTo: NETWORK_FILTER
      match:
        listener:
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        operation: MERGE
        value:
          typed_config:
            "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
            access_log:
              - name: envoy.access_loggers.file
                typed_config:
                  "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"
                  path: /dev/stdout
                  log_format:
                    json_format:
                      timestamp: "%START_TIME%"
                      request_id: "%REQ(X-REQUEST-ID)%"
                      method: "%REQ(:METHOD)%"
                      path: "%REQ(:PATH)%"
                      status: "%RESPONSE_CODE%"
                      duration_ms: "%DURATION%"
                      upstream: "%UPSTREAM_HOST%"
                      source: "%DOWNSTREAM_REMOTE_ADDRESS%"
                      trace_id: "%REQ(X-B3-TRACEID)%"
 
---
# Linkerd: Access logging configuration
# Linkerd logs to pod stdout by default
# Configure log level via proxy annotations
apiVersion: v1
kind: Pod
metadata:
  annotations:
    config.linkerd.io/proxy-log-level: "warn,linkerd=info,linkerd_proxy=debug"
    config.linkerd.io/access-log: "apache"  # Or "json" for structured

Access Log Volume

Access logs for every request generate massive data volume. At 10,000 RPS with 1KB per log entry, that's 10MB/second or ~26TB/month. Production deployments often: (1) sample logs rather than logging everything, (2) filter to only errors or slow requests, (3) use short retention periods, or (4) stream to cost-efficient storage (S3/GCS).

Service Graph Visualization

Perhaps the most immediately valuable observability feature of service mesh is the service graph—a visual representation of how services interconnect. Because the mesh sees all traffic, it can automatically construct and maintain an accurate service topology.

What Service Graphs Show:

Service nodes and their versions
Traffic flow between services (edges)
Request rates and latencies per edge
Error rates highlighted by color
Health status of each service
Traffic protocols (HTTP, gRPC, TCP)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
┌───────────────────────────────────────────────────────────────────────────┐
│                         SERVICE GRAPH VISUALIZATION                       │
│                                                                           │
│    ┌─────────────┐                                                        │
│    │   ingress   │                                                        │
│    └──────┬──────┘                                                        │
│           │ 1.2K rps                                                      │
│           ▼                                                               │
│    ┌─────────────┐           ┌─────────────┐                              │
│    │ api-gateway │───────────▶│ auth-service│                              │
│    │             │  500 rps  │ ✓ 99.9%     │                              │
│    └──────┬──────┘           └─────────────┘                              │
│           │                                                               │
│     ┌─────┴─────┬─────────────────┐                                       │
│     │           │                 │                                       │
│     ▼           ▼                 ▼                                       │
│ ┌─────────┐ ┌─────────────┐ ┌─────────────┐                               │
│ │ user-   │ │ product-    │ │ order-      │                               │
│ │ service │ │ service     │ │ service     │                               │
│ │ ✓ 99.8% │ │ ✓ 99.9%     │ │ ⚠ 97.2%    │ ◀── Degraded!                  │
│ │ 45ms p99│ │ 23ms p99    │ │ 180ms p99   │                               │
│ └────┬────┘ └──────┬──────┘ └──────┬──────┘                               │
│      │             │               │                                       │
│      ▼             │               ├───────────────┐                       │
│ ┌─────────┐        │               │               │                       │
│ │ user-db │        │               ▼               ▼                       │
│ │ (mysql) │        │        ┌─────────────┐ ┌─────────────┐               │
│ └─────────┘        │        │ inventory-  │ │ payment-    │               │
│                    │        │ service     │ │ service     │               │
│                    │        │ ✓ 99.9%     │ │ ⚠ 95.1%    │ ◀── Problem!   │
│                    │        │ 12ms p99    │ │ 450ms p99   │               │
│                    ▼        └──────┬──────┘ └──────┬──────┘               │
│             ┌─────────────┐        │               │                       │
│             │ product-db  │        ▼               ▼                       │
│             │ (postgres)  │ ┌─────────────┐ ┌─────────────┐               │
│             └─────────────┘ │ inventory-  │ │ stripe-api  │               │
│                             │ db (redis)  │ │ (external)  │               │
│                             └─────────────┘ └─────────────┘               │
│                                                                           │
│  Legend: ✓ Healthy (>99%)  ⚠ Degraded (<99%)  ✗ Failing (<95%)           │
│          Edge thickness = relative traffic volume                         │
│          Edge color: green=healthy, yellow=degraded, red=failing          │
│                                                                           │
└───────────────────────────────────────────────────────────────────────────┘

Kiali: Istio's Service Graph Tool:

Kiali is the standard visualization tool for Istio. It provides:

Real-time service graphs with traffic animation
Workload detail views showing configuration, metrics, logs
Configuration validation flagging misconfigurations
Distributed tracing integration (links to Jaeger)
Istio wizard for generating VirtualService/DestinationRule configurations
Multi-cluster support for federated mesh visualization

Linkerd Viz:

Linkerd's visualization extension (linkerd-viz) provides:

Dashboard with service overview and metrics
Service graphs similar to Kiali (less feature-rich)
Tap functionality for real-time request inspection
Top for per-route traffic statistics
Stat for command-line metrics queries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# View service-to-service traffic with success rates and latencies
linkerd viz stat deploy -n ecommerce
 
NAME              MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
api-gateway          1/1   99.98%   1234.5        5ms          12ms          45ms
order-service        1/1   97.23%    756.2       23ms          87ms         180ms   ⚠️
product-service      1/1   99.95%    523.4        8ms          21ms          34ms
inventory-service    1/1   99.92%    312.1        4ms          10ms          18ms
payment-service      1/1   95.12%    198.7       89ms         320ms         450ms   ⚠️
user-service         1/1   99.87%    445.6       12ms          34ms          67ms
 
# View traffic between services (edges)
linkerd viz edges deploy -n ecommerce
 
SRC                   DST                   SRC_NS     DST_NS      SECURED
api-gateway           order-service         ecommerce  ecommerce   √
api-gateway           product-service       ecommerce  ecommerce   √
api-gateway           user-service          ecommerce  ecommerce   √
order-service         inventory-service     ecommerce  ecommerce   √
order-service         payment-service       ecommerce  ecommerce   √
 
# Real-time traffic inspection (tap)
linkerd viz tap deploy/order-service -n ecommerce
 
req id=0:0 proxy=in  src=10.0.0.5:45678 dst=10.0.0.6:8080 tls=true :method=POST :path=/api/orders
rsp id=0:0 proxy=in  src=10.0.0.5:45678 dst=10.0.0.6:8080 tls=true :status=201 latency=45ms
req id=0:1 proxy=out src=10.0.0.6:34567 dst=10.0.0.7:8080 tls=true :method=GET :path=/inventory/check
rsp id=0:1 proxy=out src=10.0.0.6:34567 dst=10.0.0.7:8080 tls=true :status=200 latency=12ms
 
# Per-route statistics
linkerd viz routes deploy/order-service -n ecommerce
 
ROUTE              SERVICE           SUCCESS      RPS   LATENCY_P50   LATENCY_P99
POST /api/orders   order-service      97.23%   756.2        23ms         180ms
GET /api/orders    order-service      99.95%   123.4         8ms          34ms
GET /health        order-service     100.00%    10.0         1ms           2ms

Service Graphs for Incident Response

During incidents, the service graph is your first diagnostic tool. It immediately shows: (1) Which service is failing (red nodes), (2) What depends on it (upstream edges), (3) What it depends on (downstream edges—often the root cause), (4) Traffic distribution and whether the problem affects all traffic or specific paths.

Integration with Observability Stack

Service mesh generates observability data, but you need platforms to store, query, and visualize it. The mesh integrates with the broader observability ecosystem.

Common Observability Stack:

Observability Platform Integration
Signal	Collector/Store	Query/Visualize	Mesh Integration
Metrics	Prometheus	Grafana	Proxies expose /stats/prometheus endpoint; Prometheus scrapes
Traces	Jaeger, Zipkin, Tempo	Jaeger UI, Grafana Tempo	Proxies export spans via OTLP/Zipkin protocol
Logs	Elasticsearch, Loki	Kibana, Grafana	Proxy stdout → log collector → storage
All-in-One	Datadog, New Relic, Dynatrace	Native dashboards	Direct agent or OTLP export

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Prometheus: Scrape config for mesh metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    scrape_configs:
      # Scrape Istio control plane
      - job_name: 'istiod'
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names: ['istio-system']
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            action: keep
            regex: istiod
      
      # Scrape Envoy sidecars via Prometheus annotation
      - job_name: 'envoy-stats'
        metrics_path: /stats/prometheus
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: 'true'
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: (.+)
            replacement: 1:15090  # Envoy stats port
 
---
# Jaeger: Distributed tracing backend
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    # Enable distributed tracing
    enableTracing: true
    
    # Tracing provider configuration
    defaultConfig:
      tracing:
        sampling: 10.0  # 10% sampling rate
        zipkin:
          address: jaeger-collector.observability:9411
        
        # OR use OpenTelemetry Protocol (preferred)
        # openTelemetryCollector:
        #   address: otel-collector.observability:4317
 
---
# OpenTelemetry Collector: Unified telemetry pipeline
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      
      prometheus:
        config:
          scrape_configs:
            - job_name: 'mesh-metrics'
              kubernetes_sd_configs:
                - role: pod
    
    processors:
      batch:
        timeout: 10s
    
    exporters:
      prometheus:
        endpoint: "0.0.0.0:9090"
      
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      
      logging:
        loglevel: info
    
    service:
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [batch]
          exporters: [prometheus]
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]

Grafana Dashboards:

Both Istio and Linkerd provide pre-built Grafana dashboards for common metrics:

Service Dashboard: Success rate, request rate, latency by service
Workload Dashboard: Per-pod metrics, version comparison
Control Plane Dashboard: istiod/linkerd health, configuration sync
Performance Dashboard: Proxy resource consumption, connection counts

These dashboards give immediate visibility without custom dashboard development.

OpenTelemetry: The Future Standard

OpenTelemetry (OTEL) is becoming the standard for observability instrumentation. Both Istio and Linkerd support OTLP export, allowing you to use a single collector for metrics, traces, and logs. OTEL provides vendor-neutral instrumentation that works with any observability backend.

Debugging with Mesh Observability

Let's walk through a debugging scenario using mesh observability to diagnose a production issue.

Scenario: Users report slow order placement

Users complain that placing orders takes 10+ seconds when it used to take 2 seconds. How do we diagnose this using mesh observability?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
DEBUGGING WORKFLOW: SLOW ORDER PLACEMENT
 
STEP 1: Check Service Graph for Anomalies
─────────────────────────────────────────────────────────────────────────
Open Kiali/Linkerd dashboard. Look for:
  ✓ Red/yellow services (elevated error rates or latency)
  ✓ Unusual traffic patterns
  
Observation: order-service shows p99 latency of 8500ms (normally 200ms)
             payment-service shows p99 latency of 7200ms (normally 150ms)
             payment-service success rate dropped to 92% (normally 99.9%)
 
Hypothesis: Payment service is slow, causing order service slowdown
 
 
STEP 2: Drill into Payment Service Metrics (Grafana/Prometheus)
─────────────────────────────────────────────────────────────────────────
Query: rate(istio_request_duration_milliseconds_bucket{
         destination_service="payment-service",
         le="1000"
       }[5m])
 
Observation: Traffic to payment-service shows bimodal latency:
  - 60% of requests complete in <200ms (normal)
  - 40% of requests timeout at 10s (retry timeout)
  
Secondary metrics:
  - istio_tcp_connections_opened_total spike at 14:23
  - Connection pool at max (100 connections)
 
Hypothesis: Payment service backend (Stripe?) is partially failing,
            causing connection pool exhaustion
 
 
STEP 3: Examine Distributed Traces (Jaeger)
─────────────────────────────────────────────────────────────────────────
Filter: service=order-service AND duration>5s
 
Trace analysis (sample slow request):
  
  │ order-service            ████████████████████████████████  8423ms
  │  └─ inventory-service      ██                               45ms   ✓
  │  └─ payment-service           █████████████████████████████  7890ms ⚠️
  │       └─ [retry 1]               ████████████████████████    ~7500ms
  │            └─ stripe-api           (no response - timeout)  timeout
  │       └─ [retry 2]                                      ██   280ms  ✓
  │            └─ stripe-api                                █    180ms
 
Finding: Payment service retries to Stripe API. First attempt times out,
         retry succeeds. Total time = timeout + successful retry.
 
 
STEP 4: Check Access Logs for Error Details
─────────────────────────────────────────────────────────────────────────
Filter: destination_service="payment-service" AND response_code>=500
 
Log entry:
{
  "timestamp": "2024-01-15T14:23:47.142Z",
  "response_code": 504,
  "response_flags": "UT",  // Upstream timeout
  "upstream_host": "10.1.2.5:8080",
  "duration": 10001,  // 10s timeout
  "x_request_id": "abc123...",
  "trace_id": "7b3f2a1c..."
}
 
response_flags="UT" confirms upstream timeout.
 
 
STEP 5: Correlate with External Service Status
─────────────────────────────────────────────────────────────────────────
Check Stripe status page: "Elevated latency in US-East region"
 
ROOT CAUSE IDENTIFIED: Stripe API experiencing intermittent slowness.
                       Our 10s timeout triggers retries, which mostly
                       succeed, but overall order time = timeout + retry.
 
REMEDIATION:
1. Immediate: Announce known issue to customer support
2. Short-term: Reduce timeout to 5s, increase retry budget
3. Long-term: Add circuit breaker to fail fast during Stripe outages
              Consider payment queue for resilience

The Observability Correlation Pattern

Effective debugging correlates multiple signals: (1) Metrics show WHAT changed (latency spike), (2) Traces show WHERE in the call chain (payment service → Stripe), (3) Logs show details (timeout flag, specific error). Use request IDs/trace IDs to jump between signals for the same request.

Summary: Observability through Mesh

We've explored how service mesh transforms observability from a per-service burden into a platform capability. The mesh sees all traffic, enabling automatic generation of the signals needed to understand and debug distributed systems.

Key Takeaways

•Three pillars, all from infrastructure — Metrics, traces, and logs emerge from sidecar proxies without application instrumentation. Coverage is automatic and consistent.
•Golden signals by default — Latency, traffic, errors, and saturation metrics are available for every service immediately upon mesh adoption.
•Distributed tracing requires header propagation — Proxies generate spans, but applications must forward trace headers to maintain trace continuity across services.
•Service graphs illuminate topology — Visual representation of service dependencies, traffic flow, and health makes complex systems understandable at a glance.
•Integration with ecosystem is standard — Prometheus, Grafana, Jaeger, and other tools integrate via well-defined interfaces. OpenTelemetry provides unified collection.
•Debugging workflows leverage all signals — Effective diagnosis correlates metrics (what changed), traces (where in the call chain), and logs (specific details).
•Mesh observability complements, doesn't replace — Application-level observability for business logic and domain events remains important. Mesh provides the network layer view.

Module Complete:

You've now completed the Service Mesh module. You understand what service mesh is, the major implementations (Istio, Linkerd, Consul Connect), the sidecar proxy pattern that enables it, traffic management capabilities, and observability features. This knowledge equips you to evaluate, implement, and operate service mesh infrastructure in production microservices environments.

Module Complete

Congratulations! You've mastered Service Mesh concepts—from foundational architecture through implementation comparison, sidecar mechanics, traffic management, and observability. You're now prepared to make informed decisions about service mesh adoption and operate mesh infrastructure effectively in production environments.

5 / 5

Loading learning content...

System Design (HLD)Service Mesh

Service Mesh: Network Infrastructure for Microservices

LevelAdvanced

Duration90 mins

TopicService Mesh

5 / 5

Observability through Mesh

Making the Invisible Visible

Where is the latency? Which of the ten services is slow?
What's failing? Service C returns 503—but is it C's fault, or D's?
How are services connected? Which services call which?
What's the traffic pattern? How many requests per second per service?

Learning Objectives

The Three Pillars of Observability

Modern observability practice centers on three complementary signal types, often called the "three pillars." Service mesh contributes to all three, but each serves different diagnostic purposes.

1. Metrics (What is happening?)

Metrics are numerical measurements aggregated over time—request counts, error rates, latency percentiles. They answer questions like "Is the system healthy?" and "How has performance trended?"

2. Traces (Why is it happening?)

Traces follow individual requests across service boundaries, showing the path, timing, and relationships. They answer "Why was this specific request slow?" and "What services were involved?"

3. Logs (What happened in detail?)

Logs are timestamped records of discrete events—errors, state changes, significant actions. They answer "What exactly occurred at 15:32:47?" and "What was the request payload?"

Observability Pillars Comparison
Aspect	Metrics	Traces	Logs
Data Type	Numeric, aggregated	Structured, per-request	Text/structured, per-event
Cardinality	Low (aggregated)	Medium (sampled)	High (every event)
Storage Cost	Low	Medium	High
Analysis Type	Statistical, trending	Causal, path analysis	Forensic, debugging
Typical Questions	"What's the error rate?"	"Why was this request slow?"	"What was the exact error message?"
Mesh Contribution	Automatic golden signals	Automatic trace propagation	Access logging per request

Mesh Observability Philosophy:

This provides:

Complete coverage: Every meshed service is observed, even if developers forget to add instrumentation.
Consistency: The same metrics format, same trace structure, same log schema across all services.
Zero application changes: No libraries to add, no code to write, no configurations to maintain per-service.

Mesh + App Observability = Complete Picture

Automatic Metrics Generation

Sidecar proxies automatically generate metrics for every request they handle. These metrics follow standard formats (Prometheus exposition format) and capture the essential health indicators.

The Four Golden Signals:

Google SRE popularized the "Four Golden Signals" as the essential metrics for understanding service health. Service mesh proxies generate all four automatically:

Golden Signals from Service Mesh

•Latency: How long requests take. Proxies measure request duration including upstream latency. Available as histograms for percentile calculations (p50, p95, p99).
•Traffic: How many requests are flowing. Proxies count requests by service, version, method, and path. Shows load distribution and usage patterns.
•Errors: What's failing. Proxies categorize responses by status code (2xx, 4xx, 5xx) and track error rates per service and endpoint.
•Saturation: How full is the system. Proxies expose connection counts, pending requests, and queue depths indicating capacity utilization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Request count by source, destination, response code
istio_requests_total{
  source_workload="api-gateway",
  source_workload_namespace="gateway",
  destination_service="product-service.ecommerce.svc.cluster.local",
  destination_workload="product-service-v1",
  destination_version="v1",
  request_protocol="http",
  response_code="200",
  response_flags="-",
  connection_security_policy="mutual_tls"
} 1542367
 
# Request duration histogram
istio_request_duration_milliseconds_bucket{
  source_workload="api-gateway",
  destination_service="product-service.ecommerce.svc.cluster.local",
  request_protocol="http",
  response_code="200",
  le="10"    # <= 10ms bucket
} 1234567
 
istio_request_duration_milliseconds_bucket{
  ...
  le="50"    # <= 50ms bucket
} 1498765
 
istio_request_duration_milliseconds_bucket{
  ...
  le="100"   # <= 100ms bucket
} 1532456
 
# TCP connection metrics
istio_tcp_connections_opened_total{
  source_workload="order-service",
  destination_service="database.ecommerce.svc.cluster.local"
} 45678
 
istio_tcp_connections_closed_total{...} 45234
 
# Request/response byte counts
istio_request_bytes_sum{...} 234567890
istio_response_bytes_sum{...} 987654321
 
# Linkerd equivalent metrics (similar structure)
request_total{
  deployment="product-service",
  direction="inbound",
  tls="true",
  status_code="200"
} 1542367
 
response_latency_ms_bucket{
  deployment="product-service",
  direction="inbound",
  le="100"
} 1498765

Metric Labels (Dimensions):

Metrics are dimensioned by labels that enable powerful filtering and aggregation:

Label Category	Example Labels	Analysis Use
Source identity	source_workload, source_namespace, source_principal	"Who is calling?"
Destination identity	destination_service, destination_version	"What are they calling?"
Protocol/Response	request_protocol, response_code, response_flags	"How did it go?"
Security	connection_security_policy, source_principal	"Was it encrypted? Authenticated?"

These dimensions enable queries like:

Error rate for product-service v2 from api-gateway
P99 latency for all production namespace services
Traffic breakdown by response code for order-service

Cardinality Concerns

Distributed Tracing

Trace Anatomy:

A trace consists of spans—units of work with start time, duration, and context. Spans form a tree reflecting the call hierarchy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
                    DISTRIBUTED TRACE: Order Creation Request
 
Trace ID: 7b3f2a1c9d4e5f6a8b7c6d5e4f3a2b1c
Total Duration: 247ms
 
│ Timeline (ms)     0    50    100   150   200   250
│                   │     │     │     │     │     │
│
├─ api-gateway      ████████████████████████████████████████  (247ms)
│  └─ Parse request   █                                        (3ms)
│  └─ Auth check        ███                                    (12ms)
│  └─ Call order-service   ██████████████████████████████████  (228ms)
│
├─ order-service                 ██████████████████████████████ (220ms)
│  └─ Validate order               ██                          (8ms)
│  └─ Reserve inventory               ██████████████           (75ms)
│  └─ Process payment                              ████████████ (95ms)
│  └─ Send confirmation                                      ██ (12ms)
│
├─ inventory-service                    ██████████████          (75ms)
│  └─ Check stock                         ████                  (23ms)
│  └─ Reserve units                           ████████          (48ms)
│
├─ payment-service                                  ████████████ (95ms)
│  └─ Validate card                                   ███       (18ms)
│  └─ Charge card (external API)                        ██████  (65ms)
│  └─ Record transaction                                     ██ (8ms)
│
└─ notification-service                                       ██ (12ms)
   └─ Send email                                               █ (10ms)
 
 
SPAN DETAILS: payment-service / Charge card
┌────────────────────────────────────────────────────────────┐
│ Span ID:     a1b2c3d4e5f6                                  │
│ Parent Span: 9f8e7d6c5b4a (order-service/Process payment)  │
│ Start Time:  2024-01-15T14:32:47.142Z                      │
│ Duration:    65ms                                          │
│ Status:      OK                                            │
│                                                            │
│ Attributes:                                                │
│   http.method: POST                                        │
│   http.url: https://api.stripe.com/v1/charges              │
│   http.status_code: 200                                    │
│   payment.amount: 99.95                                    │
│   payment.currency: USD                                    │
│   retry.count: 0                                           │
└────────────────────────────────────────────────────────────┘

How Mesh Enables Tracing:

Service mesh proxies automatically participate in distributed tracing by:

Generating spans: Each proxy creates spans for requests it handles—both inbound (receiving) and outbound (forwarding).
Propagating context: Proxies copy trace headers (like x-request-id, x-b3-traceid, or W3C traceparent) from incoming requests to outgoing requests, maintaining trace continuity.
Exporting spans: Proxies send span data to trace collectors (Jaeger, Zipkin, Lightstep, Datadog) where they're assembled into complete traces.

The Application Participation Requirement:

Here's an important nuance: while proxies handle span creation and export, applications must propagate trace headers for traces to connect properly.

If Service A calls Service B:

A's outbound proxy adds trace headers to the request
B's inbound proxy sees the headers and creates a child span
B's application receives the request (headers included)
If B calls Service C, B must include those trace headers in the outbound request
Otherwise, C appears as a disconnected trace

This header propagation is the minimal application responsibility. Most frameworks have one-line middleware for this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Express.js middleware for trace header propagation
const TRACE_HEADERS = [
  'x-request-id',
  'x-b3-traceid',
  'x-b3-spanid',
  'x-b3-parentspanid',
  'x-b3-sampled',
  'x-b3-flags',
  'x-ot-span-context',
  'traceparent',      // W3C Trace Context
  'tracestate',
];
 
// Middleware to capture trace headers
app.use((req, res, next) => {
  // Store trace headers in request context
  req.traceHeaders = {};
  TRACE_HEADERS.forEach(header => {
    if (req.headers[header]) {
      req.traceHeaders[header] = req.headers[header];
    }
  });
  next();
});
 
// When making outbound calls, include trace headers
async function callInventoryService(orderId, traceHeaders) {
  return await fetch(`http://inventory-service/check/${orderId}`, {
    method: 'GET',
    headers: {
      'Content-Type': 'application/json',
      ...traceHeaders,  // Propagate trace headers
    },
  });
}
 
// Usage in route handler
app.post('/orders', async (req, res) => {
  const order = req.body;
  
  // Pass trace headers to downstream call
  const inventoryCheck = await callInventoryService(
    order.id, 
    req.traceHeaders  // <-- Propagate!
  );
  
  // Continue processing...
});

Sampling Reduces Cost

Access Logging

What Access Logs Capture:

Mesh access logs typically include:

Request details: method, path, protocol, headers
Response details: status code, body size, response flags
Timing: request start, response time, total duration
Identity: source and destination service identifiers
Security: TLS version, certificate information, authorization decision

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "start_time": "2024-01-15T14:32:47.142Z",
  "request_method": "POST",
  "path": "/api/v1/orders",
  "protocol": "HTTP/2",
  "response_code": 201,
  "response_flags": "-",
  "response_code_details": "via_upstream",
  "connection_termination_details": "-",
  "upstream_transport_failure_reason": "-",
  "bytes_received": 1024,
  "bytes_sent": 256,
  "duration": 247,
  "x_envoy_upstream_service_time": 245,
  "x_forwarded_for": "10.0.0.1",
  "user_agent": "Mozilla/5.0...",
  "x_request_id": "7b3f2a1c-9d4e-5f6a-8b7c-6d5e4f3a2b1c",
  "authority": "api-gateway.ecommerce.svc.cluster.local",
  "upstream_host": "10.1.2.3:8080",
  "upstream_cluster": "outbound|8080||order-service.ecommerce.svc.cluster.local",
  "upstream_local_address": "10.0.0.5:45678",
  "downstream_local_address": "10.0.0.5:8080",
  "downstream_remote_address": "10.0.0.1:54321",
  "requested_server_name": "api-gateway.ecommerce.svc.cluster.local",
  "route_name": "order-api-route",
  "traceId": "7b3f2a1c9d4e5f6a8b7c6d5e4f3a2b1c"
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Istio: Enable access logging via Telemetry API
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: access-logging
  namespace: ecommerce
spec:
  accessLogging:
    - providers:
        - name: envoy
      # Disable for specific workloads
      disabled: false
      
      # Filter to log only errors (reduce volume)
      filter:
        expression: "response.code >= 400"
 
---
# Custom log format via EnvoyFilter
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: custom-access-log
  namespace: istio-system
spec:
  configPatches:
    - applyTo: NETWORK_FILTER
      match:
        listener:
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        operation: MERGE
        value:
          typed_config:
            "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
            access_log:
              - name: envoy.access_loggers.file
                typed_config:
                  "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"
                  path: /dev/stdout
                  log_format:
                    json_format:
                      timestamp: "%START_TIME%"
                      request_id: "%REQ(X-REQUEST-ID)%"
                      method: "%REQ(:METHOD)%"
                      path: "%REQ(:PATH)%"
                      status: "%RESPONSE_CODE%"
                      duration_ms: "%DURATION%"
                      upstream: "%UPSTREAM_HOST%"
                      source: "%DOWNSTREAM_REMOTE_ADDRESS%"
                      trace_id: "%REQ(X-B3-TRACEID)%"
 
---
# Linkerd: Access logging configuration
# Linkerd logs to pod stdout by default
# Configure log level via proxy annotations
apiVersion: v1
kind: Pod
metadata:
  annotations:
    config.linkerd.io/proxy-log-level: "warn,linkerd=info,linkerd_proxy=debug"
    config.linkerd.io/access-log: "apache"  # Or "json" for structured

Access Log Volume

Service Graph Visualization

What Service Graphs Show:

Service nodes and their versions
Traffic flow between services (edges)
Request rates and latencies per edge
Error rates highlighted by color
Health status of each service
Traffic protocols (HTTP, gRPC, TCP)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
┌───────────────────────────────────────────────────────────────────────────┐
│                         SERVICE GRAPH VISUALIZATION                       │
│                                                                           │
│    ┌─────────────┐                                                        │
│    │   ingress   │                                                        │
│    └──────┬──────┘                                                        │
│           │ 1.2K rps                                                      │
│           ▼                                                               │
│    ┌─────────────┐           ┌─────────────┐                              │
│    │ api-gateway │───────────▶│ auth-service│                              │
│    │             │  500 rps  │ ✓ 99.9%     │                              │
│    └──────┬──────┘           └─────────────┘                              │
│           │                                                               │
│     ┌─────┴─────┬─────────────────┐                                       │
│     │           │                 │                                       │
│     ▼           ▼                 ▼                                       │
│ ┌─────────┐ ┌─────────────┐ ┌─────────────┐                               │
│ │ user-   │ │ product-    │ │ order-      │                               │
│ │ service │ │ service     │ │ service     │                               │
│ │ ✓ 99.8% │ │ ✓ 99.9%     │ │ ⚠ 97.2%    │ ◀── Degraded!                  │
│ │ 45ms p99│ │ 23ms p99    │ │ 180ms p99   │                               │
│ └────┬────┘ └──────┬──────┘ └──────┬──────┘                               │
│      │             │               │                                       │
│      ▼             │               ├───────────────┐                       │
│ ┌─────────┐        │               │               │                       │
│ │ user-db │        │               ▼               ▼                       │
│ │ (mysql) │        │        ┌─────────────┐ ┌─────────────┐               │
│ └─────────┘        │        │ inventory-  │ │ payment-    │               │
│                    │        │ service     │ │ service     │               │
│                    │        │ ✓ 99.9%     │ │ ⚠ 95.1%    │ ◀── Problem!   │
│                    │        │ 12ms p99    │ │ 450ms p99   │               │
│                    ▼        └──────┬──────┘ └──────┬──────┘               │
│             ┌─────────────┐        │               │                       │
│             │ product-db  │        ▼               ▼                       │
│             │ (postgres)  │ ┌─────────────┐ ┌─────────────┐               │
│             └─────────────┘ │ inventory-  │ │ stripe-api  │               │
│                             │ db (redis)  │ │ (external)  │               │
│                             └─────────────┘ └─────────────┘               │
│                                                                           │
│  Legend: ✓ Healthy (>99%)  ⚠ Degraded (<99%)  ✗ Failing (<95%)           │
│          Edge thickness = relative traffic volume                         │
│          Edge color: green=healthy, yellow=degraded, red=failing          │
│                                                                           │
└───────────────────────────────────────────────────────────────────────────┘

Kiali: Istio's Service Graph Tool:

Kiali is the standard visualization tool for Istio. It provides:

Real-time service graphs with traffic animation
Workload detail views showing configuration, metrics, logs
Configuration validation flagging misconfigurations
Distributed tracing integration (links to Jaeger)
Istio wizard for generating VirtualService/DestinationRule configurations
Multi-cluster support for federated mesh visualization

Linkerd Viz:

Linkerd's visualization extension (linkerd-viz) provides:

Dashboard with service overview and metrics
Service graphs similar to Kiali (less feature-rich)
Tap functionality for real-time request inspection
Top for per-route traffic statistics
Stat for command-line metrics queries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# View service-to-service traffic with success rates and latencies
linkerd viz stat deploy -n ecommerce
 
NAME              MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
api-gateway          1/1   99.98%   1234.5        5ms          12ms          45ms
order-service        1/1   97.23%    756.2       23ms          87ms         180ms   ⚠️
product-service      1/1   99.95%    523.4        8ms          21ms          34ms
inventory-service    1/1   99.92%    312.1        4ms          10ms          18ms
payment-service      1/1   95.12%    198.7       89ms         320ms         450ms   ⚠️
user-service         1/1   99.87%    445.6       12ms          34ms          67ms
 
# View traffic between services (edges)
linkerd viz edges deploy -n ecommerce
 
SRC                   DST                   SRC_NS     DST_NS      SECURED
api-gateway           order-service         ecommerce  ecommerce   √
api-gateway           product-service       ecommerce  ecommerce   √
api-gateway           user-service          ecommerce  ecommerce   √
order-service         inventory-service     ecommerce  ecommerce   √
order-service         payment-service       ecommerce  ecommerce   √
 
# Real-time traffic inspection (tap)
linkerd viz tap deploy/order-service -n ecommerce
 
req id=0:0 proxy=in  src=10.0.0.5:45678 dst=10.0.0.6:8080 tls=true :method=POST :path=/api/orders
rsp id=0:0 proxy=in  src=10.0.0.5:45678 dst=10.0.0.6:8080 tls=true :status=201 latency=45ms
req id=0:1 proxy=out src=10.0.0.6:34567 dst=10.0.0.7:8080 tls=true :method=GET :path=/inventory/check
rsp id=0:1 proxy=out src=10.0.0.6:34567 dst=10.0.0.7:8080 tls=true :status=200 latency=12ms
 
# Per-route statistics
linkerd viz routes deploy/order-service -n ecommerce
 
ROUTE              SERVICE           SUCCESS      RPS   LATENCY_P50   LATENCY_P99
POST /api/orders   order-service      97.23%   756.2        23ms         180ms
GET /api/orders    order-service      99.95%   123.4         8ms          34ms
GET /health        order-service     100.00%    10.0         1ms           2ms

Service Graphs for Incident Response

Integration with Observability Stack

Service mesh generates observability data, but you need platforms to store, query, and visualize it. The mesh integrates with the broader observability ecosystem.

Common Observability Stack:

Observability Platform Integration
Signal	Collector/Store	Query/Visualize	Mesh Integration
Metrics	Prometheus	Grafana	Proxies expose /stats/prometheus endpoint; Prometheus scrapes
Traces	Jaeger, Zipkin, Tempo	Jaeger UI, Grafana Tempo	Proxies export spans via OTLP/Zipkin protocol
Logs	Elasticsearch, Loki	Kibana, Grafana	Proxy stdout → log collector → storage
All-in-One	Datadog, New Relic, Dynatrace	Native dashboards	Direct agent or OTLP export

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Prometheus: Scrape config for mesh metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    scrape_configs:
      # Scrape Istio control plane
      - job_name: 'istiod'
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names: ['istio-system']
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            action: keep
            regex: istiod
      
      # Scrape Envoy sidecars via Prometheus annotation
      - job_name: 'envoy-stats'
        metrics_path: /stats/prometheus
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: 'true'
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: (.+)
            replacement: 1:15090  # Envoy stats port
 
---
# Jaeger: Distributed tracing backend
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    # Enable distributed tracing
    enableTracing: true
    
    # Tracing provider configuration
    defaultConfig:
      tracing:
        sampling: 10.0  # 10% sampling rate
        zipkin:
          address: jaeger-collector.observability:9411
        
        # OR use OpenTelemetry Protocol (preferred)
        # openTelemetryCollector:
        #   address: otel-collector.observability:4317
 
---
# OpenTelemetry Collector: Unified telemetry pipeline
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      
      prometheus:
        config:
          scrape_configs:
            - job_name: 'mesh-metrics'
              kubernetes_sd_configs:
                - role: pod
    
    processors:
      batch:
        timeout: 10s
    
    exporters:
      prometheus:
        endpoint: "0.0.0.0:9090"
      
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      
      logging:
        loglevel: info
    
    service:
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [batch]
          exporters: [prometheus]
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]

Grafana Dashboards:

Both Istio and Linkerd provide pre-built Grafana dashboards for common metrics:

Service Dashboard: Success rate, request rate, latency by service
Workload Dashboard: Per-pod metrics, version comparison
Control Plane Dashboard: istiod/linkerd health, configuration sync
Performance Dashboard: Proxy resource consumption, connection counts

These dashboards give immediate visibility without custom dashboard development.

OpenTelemetry: The Future Standard

Debugging with Mesh Observability

Let's walk through a debugging scenario using mesh observability to diagnose a production issue.

Scenario: Users report slow order placement

Users complain that placing orders takes 10+ seconds when it used to take 2 seconds. How do we diagnose this using mesh observability?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
DEBUGGING WORKFLOW: SLOW ORDER PLACEMENT
 
STEP 1: Check Service Graph for Anomalies
─────────────────────────────────────────────────────────────────────────
Open Kiali/Linkerd dashboard. Look for:
  ✓ Red/yellow services (elevated error rates or latency)
  ✓ Unusual traffic patterns
  
Observation: order-service shows p99 latency of 8500ms (normally 200ms)
             payment-service shows p99 latency of 7200ms (normally 150ms)
             payment-service success rate dropped to 92% (normally 99.9%)
 
Hypothesis: Payment service is slow, causing order service slowdown
 
 
STEP 2: Drill into Payment Service Metrics (Grafana/Prometheus)
─────────────────────────────────────────────────────────────────────────
Query: rate(istio_request_duration_milliseconds_bucket{
         destination_service="payment-service",
         le="1000"
       }[5m])
 
Observation: Traffic to payment-service shows bimodal latency:
  - 60% of requests complete in <200ms (normal)
  - 40% of requests timeout at 10s (retry timeout)
  
Secondary metrics:
  - istio_tcp_connections_opened_total spike at 14:23
  - Connection pool at max (100 connections)
 
Hypothesis: Payment service backend (Stripe?) is partially failing,
            causing connection pool exhaustion
 
 
STEP 3: Examine Distributed Traces (Jaeger)
─────────────────────────────────────────────────────────────────────────
Filter: service=order-service AND duration>5s
 
Trace analysis (sample slow request):
  
  │ order-service            ████████████████████████████████  8423ms
  │  └─ inventory-service      ██                               45ms   ✓
  │  └─ payment-service           █████████████████████████████  7890ms ⚠️
  │       └─ [retry 1]               ████████████████████████    ~7500ms
  │            └─ stripe-api           (no response - timeout)  timeout
  │       └─ [retry 2]                                      ██   280ms  ✓
  │            └─ stripe-api                                █    180ms
 
Finding: Payment service retries to Stripe API. First attempt times out,
         retry succeeds. Total time = timeout + successful retry.
 
 
STEP 4: Check Access Logs for Error Details
─────────────────────────────────────────────────────────────────────────
Filter: destination_service="payment-service" AND response_code>=500
 
Log entry:
{
  "timestamp": "2024-01-15T14:23:47.142Z",
  "response_code": 504,
  "response_flags": "UT",  // Upstream timeout
  "upstream_host": "10.1.2.5:8080",
  "duration": 10001,  // 10s timeout
  "x_request_id": "abc123...",
  "trace_id": "7b3f2a1c..."
}
 
response_flags="UT" confirms upstream timeout.
 
 
STEP 5: Correlate with External Service Status
─────────────────────────────────────────────────────────────────────────
Check Stripe status page: "Elevated latency in US-East region"
 
ROOT CAUSE IDENTIFIED: Stripe API experiencing intermittent slowness.
                       Our 10s timeout triggers retries, which mostly
                       succeed, but overall order time = timeout + retry.
 
REMEDIATION:
1. Immediate: Announce known issue to customer support
2. Short-term: Reduce timeout to 5s, increase retry budget
3. Long-term: Add circuit breaker to fail fast during Stripe outages
              Consider payment queue for resilience

The Observability Correlation Pattern

Summary: Observability through Mesh

Key Takeaways

•Three pillars, all from infrastructure — Metrics, traces, and logs emerge from sidecar proxies without application instrumentation. Coverage is automatic and consistent.
•Golden signals by default — Latency, traffic, errors, and saturation metrics are available for every service immediately upon mesh adoption.
•Distributed tracing requires header propagation — Proxies generate spans, but applications must forward trace headers to maintain trace continuity across services.
•Service graphs illuminate topology — Visual representation of service dependencies, traffic flow, and health makes complex systems understandable at a glance.
•Integration with ecosystem is standard — Prometheus, Grafana, Jaeger, and other tools integrate via well-defined interfaces. OpenTelemetry provides unified collection.
•Debugging workflows leverage all signals — Effective diagnosis correlates metrics (what changed), traces (where in the call chain), and logs (specific details).
•Mesh observability complements, doesn't replace — Application-level observability for business logic and domain events remains important. Mesh provides the network layer view.

Module Complete:

Module Complete

5 / 5