Loading learning content...
Microservices architecture represents a fundamental shift in how we design, build, and operate software applications. Rather than constructing a single monolithic unit, microservices decompose applications into a constellation of small, autonomous services that communicate over the network to deliver functionality.
This architectural style emerged from the practical experiences of companies like Netflix, Amazon, and Google—organizations that discovered that traditional monolithic architectures couldn't scale to meet their needs. The transition from in-process method calls to network-based inter-service communication represents one of the most significant paradigm shifts in modern software engineering.
At its core, microservices architecture transforms an application layer problem into a distributed systems problem. Where a monolith's internal communication occurs in nanoseconds with perfect reliability, microservices introduce network latency, partial failures, and eventual consistency as fundamental characteristics of the system.
Understanding microservices from a computer networks perspective is essential because the network becomes the system's connective tissue—the medium through which every interaction occurs. Network design, protocol selection, failure handling, and performance optimization become primary concerns rather than afterthoughts.
By the end of this page, you will understand the defining characteristics of microservices architecture, the network protocols and patterns that enable inter-service communication, the fundamental challenges introduced by distribution, and the infrastructure required to operate microservices at scale. You'll develop the knowledge to reason about the trade-offs between microservices and monolithic designs.
Microservices architecture structures an application as a collection of loosely coupled, independently deployable services, each implementing a specific business capability, communicating through well-defined network interfaces.
Core Characteristics:
Service Independence: Each microservice is a separate deployable unit with its own codebase, build process, and deployment pipeline. Services can be deployed, scaled, and updated independently.
Single Responsibility: Each service focuses on one business capability or domain context. A service should be small enough that its purpose is immediately clear—typically describable in one sentence.
Decentralized Data Management: Each service owns and manages its own data store. There is no shared database—services communicate through APIs rather than shared data.
Network-Based Communication: Services communicate exclusively through network protocols (HTTP/REST, gRPC, messaging systems) rather than in-process calls.
Technology Heterogeneity: Different services can use different programming languages, frameworks, and data stores based on their specific requirements.
Autonomous Teams: Services are typically owned by small, cross-functional teams that control the entire lifecycle from development to production.
The Size Question:
One of the most debated aspects of microservices is determining appropriate service boundaries. The name 'micro' is misleading—size isn't measured in lines of code:
"A microservice should be small enough that a single team can own it, large enough to be independently valuable, and bounded by a coherent business capability."
Practical Sizing Heuristics:
| Consideration | Guidance |
|---|---|
| Team Size | 2-8 engineers can fully own and operate the service |
| Cognitive Load | A new developer can understand the service in days, not weeks |
| Deployment Independence | The service can be deployed without coordinating with other services |
| Data Ownership | The service owns a clear, bounded set of data |
| Business Capability | The service maps to a recognizable business function |
The network implications are profound. Where a monolith might have one external HTTP endpoint, a microservices system might have hundreds of internal network paths. A single user request might traverse ten or more services, each communication adding latency, introducing potential failure points, and consuming network resources.
In microservices, the network is not an implementation detail—it's a fundamental architectural element. Every design decision must account for network latency, partial failures, and the overhead of serialization/deserialization. Engineers designing microservices must think like network engineers.
Communication between microservices occurs over the network using well-defined protocols. The choice of communication pattern fundamentally shapes system behavior, performance characteristics, and failure modes.
Synchronous Communication:
In synchronous (request-response) communication, the calling service waits for a response from the called service before continuing execution.
HTTP/REST (Representational State Transfer):
REST over HTTP is the most common synchronous protocol for microservices:
Order Service → HTTP GET /users/123 → User Service
↓
HTTP 200 OK
{"id": 123, "name": "Alice", ...}
↓
Order Service ← Response received ← User Service
Characteristics:
gRPC (gRPC Remote Procedure Call):
gRPC is a high-performance, binary protocol designed for inter-service communication:
Order Service → gRPC Call: GetUser(123) → User Service
↓
Binary Protobuf Response
↓
Order Service ← Response received ← User Service
Characteristics:
.proto files| Aspect | REST/HTTP | gRPC |
|---|---|---|
| Serialization | JSON/XML (text) | Protocol Buffers (binary) |
| Transport | HTTP/1.1 or HTTP/2 | HTTP/2 only |
| Contract | OpenAPI/Swagger (optional) | Proto files (required) |
| Streaming | Limited | Native bidirectional streaming |
| Browser Support | Native | Requires gRPC-Web proxy |
| Debugging | Easy (readable payloads) | Requires tooling |
| Latency | Higher (text parsing) | Lower (binary + HTTP/2) |
| Adoption | Universal | Growing, especially internal |
Asynchronous Communication:
Asynchronous (message-based) communication decouples services in time—the sender doesn't wait for a response.
Message Queues (Point-to-Point):
Order Service → [Order Queue] → Fulfillment Service
│ │
│ (fire and forget) │ (process when ready)
▼ ▼
Continues immediately Processes message
Characteristics:
Publish/Subscribe (Event-Driven):
┌─→ Inventory Service
│
Order Service → [Order Events] ─→ Shipping Service
│
└─→ Analytics Service
Characteristics:
Event Sourcing Pattern:
Instead of storing current state, store a sequence of events:
Event Store:
1. OrderCreated {orderId: 1, items: [...], timestamp: T1}
2. PaymentReceived {orderId: 1, amount: 99.99, timestamp: T2}
3. OrderShipped {orderId: 1, trackingId: "...", timestamp: T3}
→ Current state reconstructed by replaying events
This pattern enables:
Use synchronous communication when immediate response is required (user-facing requests, queries). Use asynchronous communication for commands that don't need immediate confirmation, event propagation, and to decouple services that shouldn't block each other. Many systems use both—synchronous for queries, asynchronous for commands.
In a microservices environment where services scale dynamically and instances come and go, service discovery becomes essential infrastructure. Unlike monolithic deployments with static IP addresses, microservices require dynamic mechanisms to locate service instances.
The Service Discovery Problem:
When the Order Service needs to call the User Service:
Client-Side Discovery:
The calling service is responsible for discovering and selecting target instances:
┌─────────────────────────────────────────────────────────┐
│ Service Registry │
│ ┌─────────────────────────────────────────────────┐ │
│ │ User Service: │ │
│ │ - instance-1: 10.0.1.10:8080 (healthy) │ │
│ │ - instance-2: 10.0.1.11:8080 (healthy) │ │
│ │ - instance-3: 10.0.1.12:8080 (unhealthy) │ │
│ └─────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
▲
│ Query
│
┌──────────────────────────┼──────────────────────────────┐
│ Order Service │ │
│ ┌───────────────────────┴─────────────────────────┐ │
│ │ Discovery Client │ │
│ │ - Cache registry data │ │
│ │ - Select healthy instance (load balancing) │ │
│ │ - Route request directly │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Technologies:
Server-Side Discovery:
A dedicated load balancer handles discovery, and clients connect to a stable endpoint:
Order Service → Load Balancer → User Service Instances
│
├─→ 10.0.1.10:8080
├─→ 10.0.1.11:8080
└─→ 10.0.1.12:8080
Technologies:
Health Checking:
Service discovery must distinguish healthy from unhealthy instances:
| Health Check Type | Description | Use Case |
|---|---|---|
| Liveness | Is the process running? | Restart crashed containers |
| Readiness | Can the service handle requests? | Remove from load balancing during startup |
| Deep Health | Are dependencies (DB, cache) accessible? | Detect cascade failures |
DNS-Based Discovery in Kubernetes:
Kubernetes provides built-in DNS for service discovery:
Service Name: user-service
Namespace: production
→ DNS Name: user-service.production.svc.cluster.local
→ Resolves to: ClusterIP (virtual IP)
→ kube-proxy routes to healthy pod IPs
This approach combines service discovery with load balancing at the network layer, transparent to application code.
Service discovery is a critical dependency—if discovery fails, services can't communicate. Design for discovery unavailability: cache registry data, fail gracefully, and monitor registry health closely. A discovery outage can cascade into a complete system failure.
The API Gateway serves as the single entry point for external clients accessing a microservices system. It abstracts the internal service topology, providing a unified interface while handling cross-cutting concerns.
Core Functions:
┌─────────────────────────────────┐
│ Internal Services │
│ │
┌──────────────┐ │ ┌────────────┐ ┌────────────┐ │
│ Client │ ──HTTP/REST──→ │ │ User │ │ Order │ │
│ (Browser, │ │ │ Service │ │ Service │ │
│ Mobile) │ │ └────────────┘ └────────────┘ │
└──────────────┘ │ │
│ │ ┌────────────┐ ┌────────────┐ │
│ │ │ Payment │ │ Inventory │ │
▼ │ │ Service │ │ Service │ │
┌──────────────────────────────┐ │ └────────────┘ └────────────┘ │
│ API Gateway │────│ │
│ ┌────────────────────────┐ │ └─────────────────────────────────┘
│ │ • Request Routing │ │
│ │ • Authentication │ │
│ │ • Rate Limiting │ │
│ │ • Request/Response │ │
│ │ Transformation │ │
│ │ • SSL Termination │ │
│ │ • Caching │ │
│ │ • Monitoring/Logging │ │
│ └────────────────────────┘ │
└──────────────────────────────┘
Request Routing:
The gateway routes requests to appropriate backend services based on path, headers, or other criteria:
/api/users/* → User Service
/api/orders/* → Order Service
/api/payments/* → Payment Service
/graphql → GraphQL Federation Service
Authentication and Authorization:
Rather than each service implementing authentication:
1. Client → Gateway: Request with JWT token
2. Gateway validates token (signature, expiration, claims)
3. Gateway enriches request with user context
4. Gateway → Service: Request with validated identity
This centralizes security logic and ensures consistent enforcement.
| Gateway | Type | Strengths | Considerations |
|---|---|---|---|
| Kong | Open Source/Enterprise | Plugin ecosystem, Lua extensibility | Operational complexity |
| AWS API Gateway | Managed Cloud | Deep AWS integration, serverless | Vendor lock-in |
| Nginx/OpenResty | Traditional/Extended | Performance, wide adoption | Limited dynamic routing |
| Envoy | Cloud Native Proxy | L7 proxy, service mesh foundation | Complexity for simple cases |
| Spring Cloud Gateway | Java Ecosystem | Tight Spring integration | JVM overhead |
| GraphQL Federation | Query Language | Unified schema, type safety | Learning curve |
Rate Limiting and Throttling:
API Gateways protect backend services from overload:
| Algorithm | Behavior | Use Case |
|---|---|---|
| Token Bucket | Allows bursts up to bucket size | API rate limiting |
| Leaky Bucket | Smooth, constant output rate | Protecting fragile backends |
| Fixed Window | Count requests per time window | Simple quota enforcement |
| Sliding Window | Rolling count over time | More accurate rate limiting |
Response Aggregation (Backend for Frontend - BFF):
For mobile or specific clients, the gateway can aggregate multiple service responses:
┌─────────────────────────────────────────────────────────────┐
│ Traditional Approach │
│ │
│ Mobile App → GET /user/123 → User Service │
│ → GET /user/123/orders → Order Service │
│ → GET /recommendations → Recommendation Service │
│ │
│ Result: 3 round trips, higher latency │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ BFF Pattern │
│ │
│ Mobile App → GET /mobile/home → Mobile BFF Gateway │
│ │ │
│ ├─→ User Service │
│ ├─→ Order Service │
│ └─→ Recommendation │
│ │
│ Result: 1 round trip, lower latency, optimized payload │
└─────────────────────────────────────────────────────────────┘
Avoid putting business logic in the API Gateway—it should focus on cross-cutting concerns. Also avoid 'gateway monolith' where all customization accumulates in one place. Consider multiple specialized gateways (mobile BFF, partner API gateway) rather than one overloaded gateway.
In monolithic applications, failure is binary—the application works or it doesn't. In microservices, partial failure is the norm. Services fail, networks partition, and latency spikes occur constantly. Designing for resilience is not optional—it's essential for system survival.
Understanding Failure Modes:
| Failure Type | Description | Example |
|---|---|---|
| Crash Failure | Service terminates unexpectedly | Out of memory, unhandled exception |
| Latency Degradation | Service responds but slowly | Database connection pool exhaustion |
| Partial Failure | Some requests fail, others succeed | One container overloaded |
| Byzantine Failure | Service returns incorrect results | Bug in business logic |
| Network Partition | Services can't reach each other | Switch failure, DNS issue |
Circuit Breaker Pattern:
Prevent cascade failures by stopping requests to failing services:
┌─────────────────────────┐
│ Circuit Breaker │
│ │
│ ┌───────────────────┐ │
Order Service ──────│──│ CLOSED (normal) │──│────► User Service
│ │ └─────────┬─────────┘ │ │
│ │ │ │ │
│ │ Failures exceed │ ✗ Fails
│ │ threshold │
│ │ │ │
│ │ ▼ │
│ │ ┌───────────────────┐ │
│ │ │ OPEN (failing) │──┼────► Immediate failure
│ │ │ No requests │ │ (no connection attempt)
│ │ └─────────┬─────────┘ │
│ │ │ │
│ │ Timeout expires │
│ │ │ │
│ │ ▼ │
│ │ ┌───────────────────┐ │
│ │ │ HALF-OPEN │──┼────► Limited test requests
│ │ │ (testing) │ │
│ │ └───────────────────┘ │
└─────────────────────────┘
Benefits:
Implementing Resilience in Practice:
1. Timeouts (Defensive Coding):
# Without timeout: Can hang indefinitely
response = requests.get('http://user-service/users/123')
# With timeout: Fails fast if service is slow
response = requests.get(
'http://user-service/users/123',
timeout=(1.0, 5.0) # (connect timeout, read timeout)
)
2. Retry with Exponential Backoff:
Attempt 1: Immediate
Attempt 2: Wait 100ms
Attempt 3: Wait 200ms
Attempt 4: Wait 400ms
+ Jitter: Random component to avoid synchronized retries
3. Circuit Breaker State Transitions:
CLOSED → OPEN: 5 failures within 30 seconds
OPEN → HALF-OPEN: After 30 seconds timeout
HALF-OPEN → CLOSED: 3 consecutive successes
HALF-OPEN → OPEN: Any failure
Libraries:
Without resilience patterns, a single slow service can take down an entire microservices system. Requests pile up waiting for the slow service, consuming threads and connections. Other services become slow, triggering more cascades. This 'gray failure' is often worse than a clean crash because it's harder to detect and recover from.
A service mesh is a dedicated infrastructure layer for handling service-to-service communication in microservices. It provides a uniform way to connect, secure, and observe services without requiring changes to application code.
The Sidecar Pattern:
The service mesh deploys a proxy (sidecar) alongside each service instance:
┌──────────────────────────────────────────────────────────────┐
│ Pod/Container │
│ ┌────────────────────┐ ┌────────────────────────────┐ │
│ │ Application │ │ Sidecar Proxy │ │
│ │ (User Service) │────▶│ (Envoy) │ │
│ │ │ │ │ │
│ │ localhost:8080 │ │ • mTLS │ │
│ │ │◀────│ • Load balancing │ │
│ └────────────────────┘ │ • Circuit breaking │ │
│ │ • Observability │ │
│ │ • Traffic control │ │
│ └────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
│ Outbound traffic
▼
┌──────────────────────────────────────────────────────────────┐
│ Pod/Container │
│ ┌────────────────────┐ ┌────────────────────────────┐ │
│ │ Sidecar Proxy │ │ Application │ │
│ │ (Envoy) │────▶│ (Order Service) │ │
│ └────────────────────┘ └────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Data Plane vs. Control Plane:
| Component | Function | Example |
|---|---|---|
| Data Plane | Sidecar proxies that handle actual traffic | Envoy, Linkerd-proxy |
| Control Plane | Management layer that configures proxies | Istio, Linkerd, Consul Connect |
Traffic Management:
Service meshes provide sophisticated traffic control:
Canary Deployments:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
http:
- route:
- destination:
host: user-service
subset: v1
weight: 90
- destination:
host: user-service
subset: v2-canary
weight: 10
A/B Testing:
http:
- match:
- headers:
x-experiment-group:
exact: "treatment"
route:
- destination:
host: user-service
subset: experimental
Fault Injection (Chaos Engineering):
http:
- fault:
delay:
percentage:
value: 10
fixedDelay: 5s
abort:
percentage:
value: 1
httpStatus: 503
| Feature | Istio | Linkerd | Consul Connect |
|---|---|---|---|
| Data Plane | Envoy | Linkerd-proxy (Rust) | Envoy or built-in |
| Complexity | High | Low-Medium | Medium |
| Resource Usage | Higher | Lower | Medium |
| mTLS | Yes | Yes | Yes |
| Traffic Management | Extensive | Basic | Good |
| Multi-cluster | Yes | Yes | Yes |
| Best For | Full feature set | Simplicity, Kubernetes | Multi-platform |
Service meshes add operational complexity and resource overhead. Consider a mesh when you have 20+ services needing consistent security (mTLS), observability, or sophisticated traffic management. For smaller deployments, application-level libraries (Resilience4j, Polly) may be more appropriate.
In monolithic applications, debugging means looking at one process. In microservices, understanding system behavior requires observability—the ability to understand internal state from external outputs. Observability rests on three pillars:
The Three Pillars:
Distributed Tracing:
Tracing is especially critical for microservices because a single user request may touch many services:
┌────────────────────────────────────────────────────────────────────────┐
│ Trace ID: abc-123 │
│ │
│ API Gateway [████████████████████████████████████] 200ms │
│ │ │
│ ├──▶ User Service [██████████] 45ms │
│ │ │ │
│ │ └──▶ Redis Cache [██] 5ms │
│ │ │
│ ├──▶ Order Service [████████████████████████] 120ms │
│ │ │ │
│ │ ├──▶ Inventory Service [████████] 40ms │
│ │ │ │ │
│ │ │ └──▶ PostgreSQL [███] 15ms │
│ │ │ │
│ │ └──▶ Payment Service [██████████] 50ms │
│ │ │ │
│ │ └──▶ External Payment API [████████] 35ms │
│ │ │
│ └──▶ Notification Service [████] 20ms (async) │
└────────────────────────────────────────────────────────────────────────┘
Trace Context Propagation:
Traces work by propagating context across service boundaries:
Service A Service B
│ │
│ HTTP Request │
│ Headers: │
│ traceparent: 00-abc123-... │
│ tracestate: vendor=value │
│ ─────────────────────────────────▶│
│ │
│ Extract trace context
│ Create child span
│ Include in outgoing requests
W3C Trace Context is the emerging standard for trace propagation across different tracing systems.
Observability Stack:
| Component | Open Source Options | Cloud Options |
|---|---|---|
| Metrics Collection | Prometheus, StatsD | CloudWatch, Datadog |
| Metrics Storage | Prometheus, VictoriaMetrics | Managed services |
| Log Aggregation | ELK Stack, Loki | CloudWatch Logs, Splunk |
| Distributed Tracing | Jaeger, Zipkin | X-Ray, Honeycomb |
| Visualization | Grafana | Built into cloud services |
| Alerting | Alertmanager, PagerDuty | Cloud-native alerting |
Structured Logging:
For effective log analysis, use structured (JSON) logging:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"service": "order-service",
"traceId": "abc-123",
"spanId": "def-456",
"userId": "user-789",
"message": "Order created",
"orderId": "order-101",
"amount": 99.99,
"duration_ms": 45
}
Structured logs enable querying, aggregation, and correlation that free-text logs can't support.
Don't wait until problems occur to build observability. Instrument services from the start, establish SLOs (Service Level Objectives), and practice debugging before production incidents. The cost of observability infrastructure is trivial compared to the cost of extended outages without visibility.
We've conducted an extensive exploration of microservices architecture—understanding not just what it is, but the profound network and operational implications of distributing application components across services. Let's consolidate the essential insights:
| Aspect | Monolithic | Microservices |
|---|---|---|
| Deployment | Single artifact | Many independent services |
| Scaling | Uniform | Per-service |
| Internal Communication | In-process | Network-based |
| Data Management | Shared database | Database per service |
| Technology Stack | Uniform | Polyglot |
| Team Structure | Feature teams in shared codebase | Service-owning teams |
| Operational Complexity | Lower | Higher |
| Failure Modes | Binary (works/fails) | Partial failures |
When to Choose Microservices:
✅ Large organizations with multiple autonomous teams ✅ Need for independent scaling of specific components ✅ Polyglot requirements (different services suit different technologies) ✅ Complex domains benefiting from bounded context isolation ✅ High availability requirements with graceful degradation
When to Avoid Microservices:
❌ Small teams (fewer than 10-15 engineers) ❌ Early-stage products with unclear requirements ❌ Simple domains without complex scaling needs ❌ Organizations without DevOps maturity ❌ When distributed system expertise is lacking
Transition to Web Applications:
With our understanding of both monolithic and microservices architectures complete, we'll next explore web applications—examining how these architectural patterns manifest in the specific context of HTTP-based applications serving browsers and providing APIs.
You now possess comprehensive knowledge of microservices architecture from a computer networks perspective—understanding not just the conceptual model but the network protocols, infrastructure requirements, and operational patterns that make distributed systems work. This knowledge is essential for designing, building, and operating modern application layer systems.