System Design (HLD)Hybrid Cloud

Hybrid Cloud Architecture

LevelAdvanced

Duration75 mins

TopicHybrid Cloud

5 / 5

Hybrid Cloud Use Cases

Where Hybrid Cloud Shines

Throughout this module, we've built a comprehensive understanding of hybrid cloud connectivity, data strategies, and migration patterns. But theory only becomes valuable when applied to real business problems.

Hybrid cloud isn't just a transition phase—it's a strategic architecture that solves specific business challenges better than pure on-premises or pure cloud alternatives. Understanding these use cases helps architects recognize when hybrid is the right answer and how to implement it effectively.

This page examines five principal use cases where hybrid cloud delivers unique value:

Disaster Recovery — Cost-effective business continuity
Cloud Bursting — Elastic capacity for peak demand
Dev/Test Environments — Agile development without production risk
Compliance-Driven Hybrid — Meeting regulatory requirements
Data Analytics and AI/ML — Leveraging cloud compute on on-premises data

What You Will Learn

By the end of this page, you will understand the business drivers, technical architectures, and implementation considerations for each major hybrid cloud use case. You'll be able to identify hybrid opportunities in your own organization and design solutions that optimize for cost, performance, and risk.

Use Case 1: Disaster Recovery

Disaster Recovery (DR) is often the gateway to hybrid cloud adoption. Traditional DR requires a secondary data center—massive capital expense for infrastructure that sits idle 99% of the time. Cloud-based DR fundamentally changes the economics.

Why Hybrid DR Works

•Pay for What You Use — DR infrastructure runs only during disasters or tests. No idle hardware costs. Storage and replication are paid continuously; compute is paid only when activated.
•Geographic Separation — Cloud regions provide physical distance from primary site. Natural disasters, regional power outages, or local events don't affect both sites.
•Elastic Recovery Resources — Cloud can provision enormous resources instantly. No pre-provisioned capacity limits; scale recovery infrastructure to match the disaster.
•Automated Failover — Cloud services enable automated health checks and failover. Reduce RTO from hours to minutes with properly configured automation.
•Regular Testing — Low-cost DR environment enables frequent disaster recovery tests without business disruption. Most on-prem DR sites are barely tested due to cost.

DR Architecture Patterns
Pattern	RTO	RPO	Cost	Best For
Backup & Restore	Hours to Days	24 hours	Lowest	Non-critical systems, archives
Pilot Light	10-30 minutes	Minutes	Low	Critical systems, database-centric
Warm Standby	Minutes	Seconds	Medium	Business-critical systems
Multi-Site Active-Active	Seconds	Zero	High	Mission-critical, zero tolerance

Pilot Light Pattern Deep Dive:

The Pilot Light pattern is particularly popular for hybrid DR. It keeps the minimum necessary infrastructure running in the cloud—like a furnace pilot light that can ignite full capacity when needed.

Components:

Always Running: Database replicas (RDS, Aurora), core networking (VPN/Direct Connect), DNS (Route 53)
Pre-Provisioned but Stopped: EC2 AMIs ready to launch, Launch Templates defined, Auto Scaling Groups scaled to zero
On-Demand: Application servers, worker nodes, scaling resources—provisioned at failover time

Failover Process:

Health monitoring detects primary site failure
Automated runbook triggers (AWS Systems Manager, Azure Automation)
Database promotes read replica to primary
Auto Scaling scales up application tier
DNS or load balancer updates route traffic to cloud
Full capacity available in 10-30 minutes

Test Your DR—Really

A DR plan that isn't tested is a DR plan that won't work. Schedule quarterly DR drills. Practice the runbook. Time the recovery. Find gaps before a real disaster reveals them. Cloud makes DR testing affordable—there's no excuse not to test.

Use Case 2: Cloud Bursting

Cloud bursting extends on-premises capacity by seamlessly expanding into cloud when demand exceeds local capacity. It combines the predictable costs of owned infrastructure for baseline load with cloud elasticity for peaks.

Cloud Bursting Scenarios

•Seasonal Retail — E-commerce handles 10x traffic during Black Friday/holiday season. On-prem handles steady-state; cloud absorbs the surge.
•Financial Quarter-End — Reporting and analytics spike at month/quarter close. Burst compute capacity processes the backlog, then scales down.
•Media Events — Live streaming during major events (sports, news). Edge distributions handle normal traffic; cloud scales for viewership spikes.
•Scientific Computing — Research simulations require massive compute for days, then nothing for months. Burst to thousands of cores for specific workloads.
•Game Launches — New game releases see enormous concurrent players initially. Cloud handles launch, on-prem handles steady-state months later.

Bursting Works Well When

•Workload peaks are predictable or detectable
•Applications are stateless or state can be shared
•Latency between on-prem and cloud is acceptable
•Peak duration is significant (hours to days)
•Cost savings outweigh implementation complexity

Bursting Struggles When

•Workloads require tight data locality (large DB queries)
•Latency-sensitive operations span both environments
•Licensing restricts cloud deployment (BYOL complexity)
•Peaks are too short (spin-up time > peak duration)
•Application architecture prevents horizontal scaling

Implementation Pattern:

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │  (Cloud-based)  │
                    └────────┬────────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │  On-Prem     │  │  On-Prem     │  │  Cloud       │
    │  Servers     │  │  Servers     │  │  Instances   │
    │  (baseline)  │  │  (baseline)  │  │  (burst)     │
    └──────────────┘  └──────────────┘  └──────────────┘
           │                 │                 │
           └─────────────────┴─────────────────┘
                             │
                    ┌────────▼────────┐
                    │  Shared Data    │
                    │  (On-prem + cache)│
                    └─────────────────┘

The load balancer (often cloud-hosted with endpoints in both environments) distributes traffic. Auto-scaling policies trigger cloud capacity when on-prem utilization exceeds thresholds.

Consider Pure Cloud Instead

Sometimes cloud bursting complexity isn't worth it. If peaks are frequent or on-prem capacity underutilized, moving entirely to cloud with auto-scaling may be simpler and cheaper. Do the TCO analysis before architecting complex bursting solutions.

Use Case 3: Dev/Test Environments

Cloud-based development and testing is one of the lowest-risk hybrid use cases. Development environments can be ephemeral, isolated, and experimental without affecting production systems.

Benefits of Cloud Dev/Test

•On-Demand Environments — Spin up complete environments for feature branches. Tear down automatically when merged. No resource contention between teams.
•Production Parity — Clone production architecture in cloud. Test with realistic configurations that on-prem shared environments can't match.
•Cost Control — Scheduled shutdowns during off-hours. Spot instances for CI/CD pipelines. Only pay for active development hours.
•Isolation — Each developer or team gets independent environments. Experiments don't impact others. Safer to test destructive scenarios.
•Data Subsetting — Clone production data to cloud dev environments (sanitized for privacy). Developers access realistic data without on-prem database load.

Environment Provisioning Patterns
Pattern	Description	Trigger	Lifecycle
Feature Branch Env	Complete stack per feature branch	PR opened	Auto-destroy on merge/close
Scheduled Shared Dev	Shared dev servers with shutdown schedules	Business hours	Running 10h/day, stopped 14h
On-Demand Load Test	Scale-out environment for performance testing	Manual trigger	Hours—destroy after test
Persistent Integration	Always-on integration environment	Continuous	Permanent, right-sized
Preview Environments	Customer-visible staging per feature	Release candidate	Days—until promoted

github-action-preview-env.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# GitHub Actions workflow for ephemeral preview environments
# Creates cloud environment on PR, destroys on merge/close
 
name: Preview Environment
 
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
 
env:
  AWS_REGION: us-east-1
  ENVIRONMENT_NAME: preview-${{ github.event.pull_request.number }}
 
jobs:
  deploy-preview:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}
      
      - name: Deploy preview environment
        run: |
          # Using Terraform to provision ephemeral environment
          cd infrastructure/preview
          terraform init
          terraform workspace select ${{ env.ENVIRONMENT_NAME }} || \
            terraform workspace new ${{ env.ENVIRONMENT_NAME }}
          terraform apply -auto-approve \
            -var="environment_name=${{ env.ENVIRONMENT_NAME }}" \
            -var="git_sha=${{ github.sha }}"
      
      - name: Comment PR with preview URL
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🚀 Preview environment deployed: https://${{ env.ENVIRONMENT_NAME }}.preview.example.com'
            })
 
  destroy-preview:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}
      
      - name: Destroy preview environment
        run: |
          cd infrastructure/preview
          terraform init
          terraform workspace select ${{ env.ENVIRONMENT_NAME }}
          terraform destroy -auto-approve
          terraform workspace select default
          terraform workspace delete ${{ env.ENVIRONMENT_NAME }}

Connecting Dev/Test to On-Prem Data

If developers need access to on-premises databases or services, establish secure connectivity (VPN, Direct Connect) from cloud dev environments to on-prem resources. Use read replicas or data subsets to avoid impacting production systems with dev traffic.

Use Case 4: Compliance-Driven Hybrid

Regulatory compliance often mandates hybrid architecture—not as a transition phase, but as a permanent design. When laws require specific data to remain in specific locations or under specific controls, hybrid becomes the only compliant option.

Compliance Scenarios Requiring Hybrid

•Data Residency Requirements — GDPR, Russian data localization laws, Chinese Cybersecurity Law require citizen data to remain within national borders. If no cloud region exists locally, on-prem is required.
•Government/Defense Workloads — FedRAMP High, IL5/IL6, SECNet classified systems may only run in specific government cloud regions or on-premises isolated networks.
•Financial Services — Some jurisdictions require transaction processing systems to run in auditable, physically accessible data centers. Cloud may handle non-regulated functions.
•Healthcare (PHI) — While major clouds are now HIPAA-compliant, some healthcare organizations keep the most sensitive PHI on-premises with only de-identified data in cloud analytics.
•Critical Infrastructure — Utilities, transportation, industrial control systems may be required to operate independently of internet-connected services.

Architecture Pattern: Regulatory Boundary

The key design pattern separates workloads by regulatory scope:

┌────────────────────────────────────────────────────────────────┐
│                     CLOUD (Unrestricted Zone)                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐    │
│  │ Public Web   │ │ Analytics    │ │ Non-sensitive        │    │
│  │ Application  │ │ & ML         │ │ Microservices        │    │
│  └──────────────┘ └──────────────┘ └──────────────────────┘    │
│                           │                                     │
│                           │ Anonymized / Tokenized Data        │
│                           ▼                                     │
└──────────────────────────────────────────────────────────────────
                            │
                            │ Encrypted API Calls
                            │ VPN / Direct Connect
                            ▼
┌────────────────────────────────────────────────────────────────┐
│                 ON-PREMISES (Regulated Zone)                    │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐    │
│  │ Customer     │ │ Transaction  │ │ Identity &           │    │
│  │ PII Database │ │ Processing   │ │ Access Mgmt          │    │
│  └──────────────┘ └──────────────┘ └──────────────────────┘    │
│                                                                 │
│  Tokenization Service: Replaces PII with tokens before         │
│  data leaves regulated zone                                     │
└────────────────────────────────────────────────────────────────┘

Cloud applications receive only tokenized or anonymized data. Real PII remains on-premises. When cloud needs actual data (e.g., customer name for display), it makes a secure API call to the on-prem tokenization service.

Compliance Control Implementation
Requirement	On-Prem Implementation	Cloud Extension
Data Residency	All PII stored locally	Only tokens/aggregates in cloud
Audit Logging	Immutable on-prem log storage	Logs stream to SIEM; retain locally
Encryption Keys	On-prem HSM manages keys	BYOK to cloud KMS; or cloud HSM in region
Access Control	On-prem identity provider (AD)	Federation to cloud IAM via SAML/OIDC
Network Isolation	Air-gapped or highly segmented	Private subnets, no internet gateway

Compliance is Design Constraint #1

Never design cloud architecture and then 'add compliance later.' Regulatory requirements must be understood upfront and drive architectural decisions. A design that violates compliance requirements must be rejected, regardless of how elegant or cost-effective it appears.

Use Case 5: Data Analytics and AI/ML

Analytics and Machine Learning represent perhaps the most compelling hybrid use case. Organizations have massive data repositories on-premises—historical transactions, manufacturing telemetry, customer interactions—that they want to analyze using cloud-scale compute and specialized ML services.

Moving all this data to cloud is often impractical (petabytes, ongoing accumulation) or prohibited (compliance). Hybrid enables cloud intelligence on on-premises data.

Hybrid Analytics Patterns

•Edge Analytics, Cloud Aggregation — Process raw data on-premises; send aggregated summaries to cloud. Reduces data transfer; cloud provides visualization and correlation across sites.
•Model Training in Cloud, Inference On-Prem — Train ML models on cloud GPU/TPU clusters using sampled or historical data. Deploy trained models on-premises for production inference.
•Data Lakehouse — Stream on-prem data to cloud data lake (S3, ADLS). Cloud analytics tools (Spark, Databricks, Snowflake) process and model. Results flow back to on-prem dashboards.
•Federated Queries — Query engines (Trino, BigQuery Omni, Athena Federated) query on-prem databases directly from cloud. No data movement; compute goes to data.
•Anonymized Cloud Processing — PII is anonymized/tokenized on-premises. Anonymized data sent to cloud for analysis. Insights returned without exposing sensitive data.

Data Flows to Cloud

•Historical data for model training
•Aggregated metrics and summaries
•Anonymized/tokenized records
•IoT sensor data (after edge processing)
•Backup and archive copies

Results Flow to On-Prem

•Trained ML models for deployment
•Business intelligence dashboards
•Anomaly detection alerts
•Predictive maintenance signals
•Optimized configuration parameters

sagemaker-hybrid-model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Hybrid ML Pipeline: Train in Cloud, Deploy On-Prem
# Using SageMaker for training, export model for edge deployment
 
import boto3
import tarfile
import os
from sagemaker import get_execution_role
from sagemaker.sklearn import SKLearn
 
# Training job in AWS SageMaker
def train_model_in_cloud(training_data_s3_path):
    """
    Train model using SageMaker managed training.
    Training data was synced from on-prem via DataSync.
    """
    sklearn_estimator = SKLearn(
        entry_point='train.py',
        source_dir='model_code/',
        role=get_execution_role(),
        instance_count=1,
        instance_type='ml.m5.xlarge',
        framework_version='1.2-1',
        hyperparameters={
            'max_depth': 10,
            'n_estimators': 100
        }
    )
    
    sklearn_estimator.fit({'train': training_data_s3_path})
    
    return sklearn_estimator.model_data
 
 
def export_model_for_onprem(model_artifact_s3, output_path):
    """
    Download model artifact from S3 and extract.
    This model.tar.gz can be deployed on-prem servers.
    """
    s3 = boto3.client('s3')
    bucket, key = model_artifact_s3.replace('s3://', '').split('/', 1)
    
    local_artifact = '/tmp/model.tar.gz'
    s3.download_file(bucket, key, local_artifact)
    
    with tarfile.open(local_artifact, 'r:gz') as tar:
        tar.extractall(output_path)
    
    print(f"Model extracted to {output_path}")
    print("Transfer this directory to on-prem inference servers.")
    return output_path
 
 
def deploy_model_onprem_inference(model_path):
    """
    On-premises inference code.
    Model runs locally on production servers.
    """
    import joblib
    
    model = joblib.load(f'{model_path}/model.joblib')
    
    # Production inference function
    def predict(features):
        return model.predict(features)
    
    return predict
 
 
if __name__ == "__main__":
    # 1. Train in cloud using on-prem data (synced to S3)
    model_artifact = train_model_in_cloud('s3://onprem-data-sync/training/')
    
    # 2. Export model for on-prem deployment
    export_model_for_onprem(model_artifact, '/shared/models/latest')
    
    # 3. On-prem servers pick up new model and serve predictions locally
    #    No cloud dependency for production inference

AWS Outposts, Azure Stack, Google Anthos

Cloud providers now offer 'cloud in your data center'—AWS Outposts, Azure Stack, Google Anthos on bare metal. These run cloud APIs on-premises, enabling cloud services (including ML inference) to run locally while maintaining cloud management. Consider these for complex hybrid analytics scenarios.

Additional Hybrid Patterns

Beyond the five primary use cases, several other hybrid patterns address specific business needs:

Specialized Hybrid Patterns

•Mainframe Extension — Legacy mainframes handle core transactions; cloud provides web frontends, mobile APIs, and analytics. Modernization without replacing the mainframe.
•Global Distribution — On-prem in headquarters region; cloud for international presence. Avoids building data centers in every geography.
•Acquisition Integration — Acquire company with different infrastructure. Hybrid enables integration without immediate consolidation.
•Shadow IT Regularization — Business units adopted SaaS/cloud individually. Hybrid formalizes connectivity and governance for existing cloud usage.
•Legacy Isolation — Legacy system can't be migrated but needs modern integration. Wrap with APIs; modern components in cloud.
•Staged Migration — Full cloud is the goal, but hybrid is the interim architecture during multi-year transformation.

Hybrid Pattern Selection Guide
Business Driver	Recommended Pattern	Primary Trade-off
Business continuity	Disaster Recovery	Ongoing replication cost vs. recovery capability
Variable demand	Cloud Bursting	Architecture complexity vs. capacity flexibility
Developer agility	Dev/Test in Cloud	Connectivity setup vs. developer productivity
Regulatory mandate	Compliance-Driven	Operational overhead vs. regulatory compliance
Data intelligence	Analytics Hybrid	Data movement costs vs. analytical capabilities
Technical debt	Staged Migration	Transition complexity vs. long-term architecture

Hybrid is a Spectrum

Most organizations don't fit cleanly into a single pattern. Production environments often combine multiple use cases—DR in cloud, dev/test in cloud, compliance-sensitive systems on-prem, analytics spanning both. Architecture must accommodate this complexity.

Implementing Hybrid Use Cases

Regardless of the specific use case, successful hybrid implementation follows common patterns:

Implementation Best Practices

•Start with Connectivity Foundation — Establish VPN or Direct Connect before developing use cases. Reliable, secured networking is prerequisite for everything else.
•Define Clear Boundaries — Document what runs where and why. Create architectural decision records (ADRs) for placement decisions. Avoid ad-hoc sprawl.
•Automate Everything — Infrastructure as Code for both environments. CI/CD pipelines that span on-prem and cloud. Manual processes don't scale and introduce errors.
•Unified Identity — Federated authentication so users and services authenticate once. Don't maintain separate accounts in each environment.
•Consistent Observability — Single monitoring platform with visibility across both environments. Correlation of metrics, logs, and traces.
•Network as Code — Treat hybrid networking as code—firewall rules, routing tables, security groups all version-controlled and deployed via automation.
•Documented Runbooks — Operational procedures for hybrid scenarios: failover, burst, cutover. Practice regularly.

Common Implementation Mistakes
Mistake	Consequence	Prevention
No IP planning	Address conflicts, routing failures	Complete IP audit before any connectivity
Inconsistent security	Gaps at boundaries, audit failures	Unified security policy across environments
Manual operations	Slow, error-prone, doesn't scale	Automate from day one; IaC mandatory
Separate tooling	Context switching, skill gaps	Single platforms for observability, deployment
No DR testing	Plan fails when disaster strikes	Quarterly DR drills; iterate on runbooks

Hybrid Adds Complexity

Every hybrid use case adds operational complexity. Connections to maintain, data to sync, security to monitor, skills to develop. Ensure the business value justifies this complexity. Sometimes the right answer is full migration or staying purely on-prem.

Summary: Hybrid Cloud Module Complete

We've completed our comprehensive journey through hybrid cloud architecture. This module has covered the foundational knowledge required to design, implement, and operate hybrid environments at enterprise scale.

Module Key Takeaways

•Hybrid cloud is strategic, not transitional — Most enterprises maintain hybrid long-term due to compliance, data gravity, and legacy dependencies.
•Connectivity is foundational — VPN for quick/cheap, Direct Connect for performance/reliability. Production uses both for resilience.
•Data is the complexity center — Replication, CDC, caching, and compliance requirements make data the hardest aspect of hybrid architecture.
•Migration is a spectrum — The 6 Rs (Rehost through Retire) provide frameworks for workload-appropriate migration strategies.
•Use cases dictate architecture — DR, bursting, dev/test, compliance, and analytics each require different hybrid patterns.
•Operational maturity is essential — Automation, observability, security, and documented procedures prevent hybrid from becoming unmanageable.

The Hybrid Cloud Architect's Mindset:

Expert hybrid architects don't view hybrid as a compromise or transition state. They recognize it as a powerful architecture pattern that provides:

Flexibility — Right workload, right location
Control — Compliance and sovereignty where needed
Economics — Balance owned infrastructure with elastic cloud
Resilience — Geographic diversity and failover options
Innovation — Cloud capabilities applied to on-prem data

Mastering hybrid cloud means understanding that modern enterprises don't live in a single world—they operate across multiple infrastructure paradigms simultaneously. The ability to bridge these worlds elegantly, securely, and efficiently is a defining skill for cloud architects.

Module Complete: Hybrid Cloud Architecture

Congratulations! You've completed the Hybrid Cloud module. You now have comprehensive knowledge of connectivity technologies, data strategies, migration patterns, and real-world use cases. This foundation equips you to design hybrid architectures that meet complex business requirements while maintaining operational excellence.

5 / 5

Loading learning content...

System Design (HLD)Hybrid Cloud

Hybrid Cloud Architecture

LevelAdvanced

Duration75 mins

TopicHybrid Cloud

5 / 5

Hybrid Cloud Use Cases

Where Hybrid Cloud Shines

This page examines five principal use cases where hybrid cloud delivers unique value:

Disaster Recovery — Cost-effective business continuity
Cloud Bursting — Elastic capacity for peak demand
Dev/Test Environments — Agile development without production risk
Compliance-Driven Hybrid — Meeting regulatory requirements
Data Analytics and AI/ML — Leveraging cloud compute on on-premises data

What You Will Learn

Use Case 1: Disaster Recovery

Why Hybrid DR Works

•Pay for What You Use — DR infrastructure runs only during disasters or tests. No idle hardware costs. Storage and replication are paid continuously; compute is paid only when activated.
•Geographic Separation — Cloud regions provide physical distance from primary site. Natural disasters, regional power outages, or local events don't affect both sites.
•Elastic Recovery Resources — Cloud can provision enormous resources instantly. No pre-provisioned capacity limits; scale recovery infrastructure to match the disaster.
•Automated Failover — Cloud services enable automated health checks and failover. Reduce RTO from hours to minutes with properly configured automation.
•Regular Testing — Low-cost DR environment enables frequent disaster recovery tests without business disruption. Most on-prem DR sites are barely tested due to cost.

DR Architecture Patterns
Pattern	RTO	RPO	Cost	Best For
Backup & Restore	Hours to Days	24 hours	Lowest	Non-critical systems, archives
Pilot Light	10-30 minutes	Minutes	Low	Critical systems, database-centric
Warm Standby	Minutes	Seconds	Medium	Business-critical systems
Multi-Site Active-Active	Seconds	Zero	High	Mission-critical, zero tolerance

Pilot Light Pattern Deep Dive:

Components:

Always Running: Database replicas (RDS, Aurora), core networking (VPN/Direct Connect), DNS (Route 53)
Pre-Provisioned but Stopped: EC2 AMIs ready to launch, Launch Templates defined, Auto Scaling Groups scaled to zero
On-Demand: Application servers, worker nodes, scaling resources—provisioned at failover time

Failover Process:

Health monitoring detects primary site failure
Automated runbook triggers (AWS Systems Manager, Azure Automation)
Database promotes read replica to primary
Auto Scaling scales up application tier
DNS or load balancer updates route traffic to cloud
Full capacity available in 10-30 minutes

Test Your DR—Really

Use Case 2: Cloud Bursting

Cloud Bursting Scenarios

•Seasonal Retail — E-commerce handles 10x traffic during Black Friday/holiday season. On-prem handles steady-state; cloud absorbs the surge.
•Financial Quarter-End — Reporting and analytics spike at month/quarter close. Burst compute capacity processes the backlog, then scales down.
•Media Events — Live streaming during major events (sports, news). Edge distributions handle normal traffic; cloud scales for viewership spikes.
•Scientific Computing — Research simulations require massive compute for days, then nothing for months. Burst to thousands of cores for specific workloads.
•Game Launches — New game releases see enormous concurrent players initially. Cloud handles launch, on-prem handles steady-state months later.

Bursting Works Well When

•Workload peaks are predictable or detectable
•Applications are stateless or state can be shared
•Latency between on-prem and cloud is acceptable
•Peak duration is significant (hours to days)
•Cost savings outweigh implementation complexity

Bursting Struggles When

•Workloads require tight data locality (large DB queries)
•Latency-sensitive operations span both environments
•Licensing restricts cloud deployment (BYOL complexity)
•Peaks are too short (spin-up time > peak duration)
•Application architecture prevents horizontal scaling

Implementation Pattern:

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │  (Cloud-based)  │
                    └────────┬────────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │  On-Prem     │  │  On-Prem     │  │  Cloud       │
    │  Servers     │  │  Servers     │  │  Instances   │
    │  (baseline)  │  │  (baseline)  │  │  (burst)     │
    └──────────────┘  └──────────────┘  └──────────────┘
           │                 │                 │
           └─────────────────┴─────────────────┘
                             │
                    ┌────────▼────────┐
                    │  Shared Data    │
                    │  (On-prem + cache)│
                    └─────────────────┘

The load balancer (often cloud-hosted with endpoints in both environments) distributes traffic. Auto-scaling policies trigger cloud capacity when on-prem utilization exceeds thresholds.

Consider Pure Cloud Instead

Use Case 3: Dev/Test Environments

Cloud-based development and testing is one of the lowest-risk hybrid use cases. Development environments can be ephemeral, isolated, and experimental without affecting production systems.

Benefits of Cloud Dev/Test

•On-Demand Environments — Spin up complete environments for feature branches. Tear down automatically when merged. No resource contention between teams.
•Production Parity — Clone production architecture in cloud. Test with realistic configurations that on-prem shared environments can't match.
•Cost Control — Scheduled shutdowns during off-hours. Spot instances for CI/CD pipelines. Only pay for active development hours.
•Isolation — Each developer or team gets independent environments. Experiments don't impact others. Safer to test destructive scenarios.
•Data Subsetting — Clone production data to cloud dev environments (sanitized for privacy). Developers access realistic data without on-prem database load.

Environment Provisioning Patterns
Pattern	Description	Trigger	Lifecycle
Feature Branch Env	Complete stack per feature branch	PR opened	Auto-destroy on merge/close
Scheduled Shared Dev	Shared dev servers with shutdown schedules	Business hours	Running 10h/day, stopped 14h
On-Demand Load Test	Scale-out environment for performance testing	Manual trigger	Hours—destroy after test
Persistent Integration	Always-on integration environment	Continuous	Permanent, right-sized
Preview Environments	Customer-visible staging per feature	Release candidate	Days—until promoted

github-action-preview-env.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# GitHub Actions workflow for ephemeral preview environments
# Creates cloud environment on PR, destroys on merge/close
 
name: Preview Environment
 
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
 
env:
  AWS_REGION: us-east-1
  ENVIRONMENT_NAME: preview-${{ github.event.pull_request.number }}
 
jobs:
  deploy-preview:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}
      
      - name: Deploy preview environment
        run: |
          # Using Terraform to provision ephemeral environment
          cd infrastructure/preview
          terraform init
          terraform workspace select ${{ env.ENVIRONMENT_NAME }} || \
            terraform workspace new ${{ env.ENVIRONMENT_NAME }}
          terraform apply -auto-approve \
            -var="environment_name=${{ env.ENVIRONMENT_NAME }}" \
            -var="git_sha=${{ github.sha }}"
      
      - name: Comment PR with preview URL
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🚀 Preview environment deployed: https://${{ env.ENVIRONMENT_NAME }}.preview.example.com'
            })
 
  destroy-preview:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}
      
      - name: Destroy preview environment
        run: |
          cd infrastructure/preview
          terraform init
          terraform workspace select ${{ env.ENVIRONMENT_NAME }}
          terraform destroy -auto-approve
          terraform workspace select default
          terraform workspace delete ${{ env.ENVIRONMENT_NAME }}

Connecting Dev/Test to On-Prem Data

Use Case 4: Compliance-Driven Hybrid

Compliance Scenarios Requiring Hybrid

•Data Residency Requirements — GDPR, Russian data localization laws, Chinese Cybersecurity Law require citizen data to remain within national borders. If no cloud region exists locally, on-prem is required.
•Government/Defense Workloads — FedRAMP High, IL5/IL6, SECNet classified systems may only run in specific government cloud regions or on-premises isolated networks.
•Financial Services — Some jurisdictions require transaction processing systems to run in auditable, physically accessible data centers. Cloud may handle non-regulated functions.
•Healthcare (PHI) — While major clouds are now HIPAA-compliant, some healthcare organizations keep the most sensitive PHI on-premises with only de-identified data in cloud analytics.
•Critical Infrastructure — Utilities, transportation, industrial control systems may be required to operate independently of internet-connected services.

Architecture Pattern: Regulatory Boundary

The key design pattern separates workloads by regulatory scope:

┌────────────────────────────────────────────────────────────────┐
│                     CLOUD (Unrestricted Zone)                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐    │
│  │ Public Web   │ │ Analytics    │ │ Non-sensitive        │    │
│  │ Application  │ │ & ML         │ │ Microservices        │    │
│  └──────────────┘ └──────────────┘ └──────────────────────┘    │
│                           │                                     │
│                           │ Anonymized / Tokenized Data        │
│                           ▼                                     │
└──────────────────────────────────────────────────────────────────
                            │
                            │ Encrypted API Calls
                            │ VPN / Direct Connect
                            ▼
┌────────────────────────────────────────────────────────────────┐
│                 ON-PREMISES (Regulated Zone)                    │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐    │
│  │ Customer     │ │ Transaction  │ │ Identity &           │    │
│  │ PII Database │ │ Processing   │ │ Access Mgmt          │    │
│  └──────────────┘ └──────────────┘ └──────────────────────┘    │
│                                                                 │
│  Tokenization Service: Replaces PII with tokens before         │
│  data leaves regulated zone                                     │
└────────────────────────────────────────────────────────────────┘

Compliance Control Implementation
Requirement	On-Prem Implementation	Cloud Extension
Data Residency	All PII stored locally	Only tokens/aggregates in cloud
Audit Logging	Immutable on-prem log storage	Logs stream to SIEM; retain locally
Encryption Keys	On-prem HSM manages keys	BYOK to cloud KMS; or cloud HSM in region
Access Control	On-prem identity provider (AD)	Federation to cloud IAM via SAML/OIDC
Network Isolation	Air-gapped or highly segmented	Private subnets, no internet gateway

Compliance is Design Constraint #1

Use Case 5: Data Analytics and AI/ML

Moving all this data to cloud is often impractical (petabytes, ongoing accumulation) or prohibited (compliance). Hybrid enables cloud intelligence on on-premises data.

Hybrid Analytics Patterns

•Edge Analytics, Cloud Aggregation — Process raw data on-premises; send aggregated summaries to cloud. Reduces data transfer; cloud provides visualization and correlation across sites.
•Model Training in Cloud, Inference On-Prem — Train ML models on cloud GPU/TPU clusters using sampled or historical data. Deploy trained models on-premises for production inference.
•Data Lakehouse — Stream on-prem data to cloud data lake (S3, ADLS). Cloud analytics tools (Spark, Databricks, Snowflake) process and model. Results flow back to on-prem dashboards.
•Federated Queries — Query engines (Trino, BigQuery Omni, Athena Federated) query on-prem databases directly from cloud. No data movement; compute goes to data.
•Anonymized Cloud Processing — PII is anonymized/tokenized on-premises. Anonymized data sent to cloud for analysis. Insights returned without exposing sensitive data.

Data Flows to Cloud

•Historical data for model training
•Aggregated metrics and summaries
•Anonymized/tokenized records
•IoT sensor data (after edge processing)
•Backup and archive copies

Results Flow to On-Prem

•Trained ML models for deployment
•Business intelligence dashboards
•Anomaly detection alerts
•Predictive maintenance signals
•Optimized configuration parameters

sagemaker-hybrid-model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Hybrid ML Pipeline: Train in Cloud, Deploy On-Prem
# Using SageMaker for training, export model for edge deployment
 
import boto3
import tarfile
import os
from sagemaker import get_execution_role
from sagemaker.sklearn import SKLearn
 
# Training job in AWS SageMaker
def train_model_in_cloud(training_data_s3_path):
    """
    Train model using SageMaker managed training.
    Training data was synced from on-prem via DataSync.
    """
    sklearn_estimator = SKLearn(
        entry_point='train.py',
        source_dir='model_code/',
        role=get_execution_role(),
        instance_count=1,
        instance_type='ml.m5.xlarge',
        framework_version='1.2-1',
        hyperparameters={
            'max_depth': 10,
            'n_estimators': 100
        }
    )
    
    sklearn_estimator.fit({'train': training_data_s3_path})
    
    return sklearn_estimator.model_data
 
 
def export_model_for_onprem(model_artifact_s3, output_path):
    """
    Download model artifact from S3 and extract.
    This model.tar.gz can be deployed on-prem servers.
    """
    s3 = boto3.client('s3')
    bucket, key = model_artifact_s3.replace('s3://', '').split('/', 1)
    
    local_artifact = '/tmp/model.tar.gz'
    s3.download_file(bucket, key, local_artifact)
    
    with tarfile.open(local_artifact, 'r:gz') as tar:
        tar.extractall(output_path)
    
    print(f"Model extracted to {output_path}")
    print("Transfer this directory to on-prem inference servers.")
    return output_path
 
 
def deploy_model_onprem_inference(model_path):
    """
    On-premises inference code.
    Model runs locally on production servers.
    """
    import joblib
    
    model = joblib.load(f'{model_path}/model.joblib')
    
    # Production inference function
    def predict(features):
        return model.predict(features)
    
    return predict
 
 
if __name__ == "__main__":
    # 1. Train in cloud using on-prem data (synced to S3)
    model_artifact = train_model_in_cloud('s3://onprem-data-sync/training/')
    
    # 2. Export model for on-prem deployment
    export_model_for_onprem(model_artifact, '/shared/models/latest')
    
    # 3. On-prem servers pick up new model and serve predictions locally
    #    No cloud dependency for production inference

AWS Outposts, Azure Stack, Google Anthos

Additional Hybrid Patterns

Beyond the five primary use cases, several other hybrid patterns address specific business needs:

Specialized Hybrid Patterns

•Mainframe Extension — Legacy mainframes handle core transactions; cloud provides web frontends, mobile APIs, and analytics. Modernization without replacing the mainframe.
•Global Distribution — On-prem in headquarters region; cloud for international presence. Avoids building data centers in every geography.
•Acquisition Integration — Acquire company with different infrastructure. Hybrid enables integration without immediate consolidation.
•Shadow IT Regularization — Business units adopted SaaS/cloud individually. Hybrid formalizes connectivity and governance for existing cloud usage.
•Legacy Isolation — Legacy system can't be migrated but needs modern integration. Wrap with APIs; modern components in cloud.
•Staged Migration — Full cloud is the goal, but hybrid is the interim architecture during multi-year transformation.

Hybrid Pattern Selection Guide
Business Driver	Recommended Pattern	Primary Trade-off
Business continuity	Disaster Recovery	Ongoing replication cost vs. recovery capability
Variable demand	Cloud Bursting	Architecture complexity vs. capacity flexibility
Developer agility	Dev/Test in Cloud	Connectivity setup vs. developer productivity
Regulatory mandate	Compliance-Driven	Operational overhead vs. regulatory compliance
Data intelligence	Analytics Hybrid	Data movement costs vs. analytical capabilities
Technical debt	Staged Migration	Transition complexity vs. long-term architecture

Hybrid is a Spectrum

Implementing Hybrid Use Cases

Regardless of the specific use case, successful hybrid implementation follows common patterns:

Implementation Best Practices

•Start with Connectivity Foundation — Establish VPN or Direct Connect before developing use cases. Reliable, secured networking is prerequisite for everything else.
•Define Clear Boundaries — Document what runs where and why. Create architectural decision records (ADRs) for placement decisions. Avoid ad-hoc sprawl.
•Automate Everything — Infrastructure as Code for both environments. CI/CD pipelines that span on-prem and cloud. Manual processes don't scale and introduce errors.
•Unified Identity — Federated authentication so users and services authenticate once. Don't maintain separate accounts in each environment.
•Consistent Observability — Single monitoring platform with visibility across both environments. Correlation of metrics, logs, and traces.
•Network as Code — Treat hybrid networking as code—firewall rules, routing tables, security groups all version-controlled and deployed via automation.
•Documented Runbooks — Operational procedures for hybrid scenarios: failover, burst, cutover. Practice regularly.

Common Implementation Mistakes
Mistake	Consequence	Prevention
No IP planning	Address conflicts, routing failures	Complete IP audit before any connectivity
Inconsistent security	Gaps at boundaries, audit failures	Unified security policy across environments
Manual operations	Slow, error-prone, doesn't scale	Automate from day one; IaC mandatory
Separate tooling	Context switching, skill gaps	Single platforms for observability, deployment
No DR testing	Plan fails when disaster strikes	Quarterly DR drills; iterate on runbooks

Hybrid Adds Complexity

Summary: Hybrid Cloud Module Complete

Module Key Takeaways

•Hybrid cloud is strategic, not transitional — Most enterprises maintain hybrid long-term due to compliance, data gravity, and legacy dependencies.
•Connectivity is foundational — VPN for quick/cheap, Direct Connect for performance/reliability. Production uses both for resilience.
•Data is the complexity center — Replication, CDC, caching, and compliance requirements make data the hardest aspect of hybrid architecture.
•Migration is a spectrum — The 6 Rs (Rehost through Retire) provide frameworks for workload-appropriate migration strategies.
•Use cases dictate architecture — DR, bursting, dev/test, compliance, and analytics each require different hybrid patterns.
•Operational maturity is essential — Automation, observability, security, and documented procedures prevent hybrid from becoming unmanageable.

The Hybrid Cloud Architect's Mindset:

Expert hybrid architects don't view hybrid as a compromise or transition state. They recognize it as a powerful architecture pattern that provides:

Flexibility — Right workload, right location
Control — Compliance and sovereignty where needed
Economics — Balance owned infrastructure with elastic cloud
Resilience — Geographic diversity and failover options
Innovation — Cloud capabilities applied to on-prem data

Module Complete: Hybrid Cloud Architecture

5 / 5