Interview Preparation - Learning Module

Loading content...

0/228

Practical Interview Scenarios

Real-World Interview Scenarios

The most valuable interview questions aren't about isolated facts—they're scenarios that mirror real production situations. These questions test your ability to synthesize knowledge across multiple domains, prioritize under uncertainty, and communicate effectively while problem-solving.

This page presents the most common practical scenarios encountered in network engineering interviews, complete with:

The scenario as presented — Exactly what the interviewer says
What's being evaluated — The hidden competencies being assessed
A structured response approach — How to tackle the problem systematically
Model answer with reasoning — An expert-level response
Common mistakes to avoid — Traps that catch unprepared candidates

What You Will Learn

By the end of this page, you will be able to confidently approach troubleshooting scenarios, design questions, and production incident simulations. You'll understand how to structure responses that demonstrate both technical depth and practical experience.

Troubleshooting Scenarios

Troubleshooting scenarios test your diagnostic methodology, ability to prioritize, and capacity to work systematically under pressure. These are the bread and butter of network engineering interviews.

Scenario 1: The Slow Application

Interviewer Says:

"Users are reporting that our internal CRM application is very slow. Sometimes it takes 30 seconds to load a page. The application team says the server is fine. How would you investigate?"

What's Being Evaluated

• Systematic problem-solving approach vs. jumping to conclusions • Ability to divide problem between network and application • Understanding of latency sources (DNS, TCP, TLS, routing) • Communication skills while troubleshooting • Experience with appropriate diagnostic tools

Scenario 1: Structured Response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# STRUCTURED RESPONSE APPROACH
 
STEP 1: GATHER INFORMATION
"First, I'd clarify the symptoms:
- Is it slow for all users, or specific groups/locations?
- Is it slow at certain times, or consistently?
- When did it start? Any recent changes?
- Are other applications affected?
 
Let's say: It's slow for all users, started 3 days ago, 
only the CRM is affected, no known recent changes."
 
STEP 2: ISOLATE THE PROBLEM DOMAIN
"Since the app team says the server is fine, let's verify 
that and narrow down network vs. application:
 
I'd run tests from a workstation:
- ping crm-server.internal          → Tests basic L3 reachability
- traceroute crm-server.internal    → Identifies slow hop if any
- curl -w '%{time_connect} %{time_starttransfer} %{time_total}' https://crm/healthcheck
 
The curl timing breakdown shows:
- time_connect: TCP handshake latency
- time_starttransfer: Time to first byte (server processing)
- time_total: Full request time"
 
STEP 3: INTERPRET AND DRILL DOWN
"Suppose results show:
- Ping: 2ms (normal)
- Traceroute: All hops <5ms (normal)
- Curl: connect=0.002s, starttransfer=28s, total=28.5s
 
This tells me: Network is fine (fast connect), but time-to-first-byte 
is 28 seconds. The delay is in server processing, not network.
 
I'd push back to the app team with data: 'Network latency is 2ms, 
but time-to-first-byte is 28 seconds. Can you check database 
queries or external service calls?'"
 
STEP 4: ALTERNATIVE PATH (if network symptoms)
"If curl showed connect=25s, that indicates TCP handshake delay.
I'd investigate:
- DNS resolution time (dig crm-server.internal)
- Firewall processing (any new rules? Rate limiting?)
- Server load (is it ACKing slowly?)
- MTU issues (packet fragmentation, retransmissions)"
 
STEP 5: CONCLUDE WITH VERIFICATION
"Once root cause is found and fixed, I'd:
- Re-run timing tests to confirm improvement
- Set up monitoring for this metric going forward
- Document the incident for future reference"

Common Mistakes to Avoid

•Immediately blaming the network — Without evidence, this damages credibility with other teams
•Not quantifying the problem — 'Slow' is subjective; measure actual latencies
•Skipping clarifying questions — Different symptoms suggest different causes
•Using only ping — Ping only tests L3/ICMP; many problems are in upper layers

Scenario 2: Intermittent Connectivity

Interviewer Says:

"We're getting reports of intermittent connectivity issues. Some users lose connection to everything, then it works again a minute later. This has been happening for a week. How would you approach this?"

What's Being Evaluated

• Pattern recognition for intermittent issues • Understanding of Layer 2 failure modes (STP, ARP, DHCP) • Ability to triangulate from incomplete information • Experience with root cause analysis over time

Scenario 2: Structured Response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# INTERMITTENT CONNECTIVITY INVESTIGATION
 
PHASE 1: PATTERN IDENTIFICATION
"Intermittent issues require pattern analysis:
- Are the same users always affected, or random users?
- Is there a pattern in timing (time of day, duration)?
- Does it correlate with any other events (backups, scans)?
- What 'everything' means: Internet? Internal only? Both?
 
Key question: When connectivity fails, can users ping their 
default gateway? This isolates L2/L3 local issues from routing."
 
PHASE 2: HYPOTHESIS FORMATION
 
Based on symptoms, top hypotheses:
 
1. SPANNING TREE RECONVERGENCE
   - Symptoms: All users on a switch/VLAN lose connectivity briefly
   - Cause: Topology change → STP recalculates → 30-50s outage
   - Check: Switch logs for topology change notifications
   - Often caused by: Unmanaged switch plugged in, port flapping
 
2. DHCP ISSUES
   - Symptoms: Users lose connectivity, 'Network Limited'
   - Cause: Lease renewal failures, rogue DHCP, IP conflicts
   - Check: DHCP server logs, scope exhaustion, lease times
 
3. ARP TABLE ISSUES
   - Symptoms: Users can ping gateway IP but not beyond
   - Cause: ARP cache poisoning, duplicate IPs, flapping
   - Check: ARP tables on switches and gateway, gratuitous ARP
 
4. DEFAULT GATEWAY REDUNDANCY
   - Symptoms: Traffic fails during failover
   - Cause: VRRP/HSRP misconfiguration, preemption battles
   - Check: FHRP logs, virtual IP advertisement
 
PHASE 3: DATA COLLECTION
"I'd collect:
- Time-correlated logs from affected switches
- Spanning tree events: show spanning-tree detail
- MAC address table changes: show mac address-table count
- Router/switch CPU utilization (high CPU → slow to respond)
- Syslog correlation across infrastructure
 
For STP specifically: enable 'spanning-tree logging' and
look for TCN (Topology Change Notification) events."
 
PHASE 4: ROOT CAUSE AND FIX
"If STP confirmed:
- Identify the port causing topology changes
- Enable BPDU Guard on access ports
- Enable PortFast on access ports
- Consider RSTP if using legacy 802.1D"

Scenario 3: Can't Reach a Specific Server

Interviewer Says:

"A developer says they can't reach a server at 10.50.20.100 from their workstation. But they can reach other servers in that same 10.50.20.0/24 network. What do you check?"

Quick Triage Approach

•Verify from your machine — 'First, let me try to reach 10.50.20.100 myself to confirm the issue isn't isolated to their workstation.'
•Check if server is UP — Can you reach the server from its own subnet? Is it responding to its default gateway?
•ARP resolution — On the gateway for 10.50.20.0/24, is there an ARP entry for .100? If not, the server may be down or on wrong VLAN.
•Duplicate IP check — Are there two different MACs associated with .100? Run 'arp -a' repeatedly.
•Host-based firewall — Server might be up but blocking. Can you SSH/RDP to it from local subnet?
•ACL/Firewall in path — Is there a firewall between source and destination that might have a rule blocking this specific host?

Key Insight

When one server is unreachable but others in the same subnet work, the problem is almost always: (1) The server itself (down, firewall, wrong IP), (2) A duplicate IP address situation, or (3) A very specific ACL targeting that IP. It's rarely a network routing issue since other IPs in the same subnet work.

Network Design Scenarios

Design scenarios test architectural thinking, ability to balance trade-offs, and understanding of real-world constraints. These are common in senior and architect-level interviews.

Scenario 4: Design a Multi-Site Enterprise Network

Interviewer Says:

"We're a company with 3 offices: headquarters (500 users), a branch office (50 users), and a remote development team (20 users). We have an on-prem data center at HQ and use AWS for some workloads. Design the network connectivity."

What's Being Evaluated

• Ability to gather requirements before designing • Understanding of WAN options (MPLS, SD-WAN, VPN) • Cloud connectivity knowledge (Direct Connect, VPN) • Redundancy and failover planning • Cost-awareness in design decisions

Scenario 4: Design Approach
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# NETWORK DESIGN SCENARIO RESPONSE
 
STEP 1: REQUIREMENTS GATHERING
"Before designing, I'd ask:
- What applications do remote sites access? Latency-sensitive?
- What bandwidth is needed at each site?
- Uptime requirements? Is 99.9% acceptable, or need 99.99%?
- Budget constraints? Enterprise WAN can be expensive.
- Compliance requirements? Any data that can't traverse internet?
- Future growth? Are additional sites planned?
 
Assumptions for this exercise:
- Standard office applications + VoIP at branches
- 100 Mbps at HQ, 25 Mbps at branch, 10 Mbps at remote site
- 99.9% uptime acceptable
- Cost-conscious but not minimal
- No strict compliance requiring private circuits"
 
STEP 2: HIGH-LEVEL ARCHITECTURE
 
┌─────────────────────────────────────────────────────────────────────┐
│                         HEADQUARTERS (500 users)                    │
│  ┌─────────────────┐    ┌─────────────────┐    ┌────────────────┐  │
│  │ Core Switches   │────│  Firewalls (HA) │────│  WAN Edge      │  │
│  │ (L3, redundant) │    │  (Active/Stby)  │    │  (SD-WAN)      │  │
│  └────────┬────────┘    └────────┬────────┘    └────────┬───────┘  │
│           │                      │                      │          │
│           ├──────────────────────┼──────────────────────┤          │
│           │                      │                      │          │
│  ┌────────▼────────┐    ┌────────▼────────┐    ┌───────▼────────┐  │
│  │ Data Center     │    │ DMZ             │    │ Primary ISP    │  │
│  │ (on-prem apps)  │    │ (web servers)   │    │ + Backup ISP   │  │
│  └─────────────────┘    └─────────────────┘    └───────┬────────┘  │
└────────────────────────────────────────────────────────│────────────┘
                                                         │
                    ┌────────────────────────────────────┼────────────┐
                    │                                    │            │
            ┌───────▼───────┐               ┌───────────▼──────────┐ │
            │  AWS Cloud    │               │  Branch Office       │ │
            │  ┌─────────┐  │               │  ┌─────────────────┐ │ │
            │  │ VPC     │  │               │  │ SD-WAN Edge     │ │ │
            │  │ Transit │  │               │  │ + Local ISP     │ │ │
            │  │ Gateway │  │               │  │ + LTE Backup    │ │ │
            │  └─────────┘  │               │  └─────────────────┘ │ │
            └───────────────┘               └──────────────────────┘ │
                    │                                                │
                    │         ┌──────────────────────────────────────┘
                    │         │
            ┌───────▼─────────▼───────┐
            │  Remote Dev Team        │
            │  ┌─────────────────────┐│
            │  │ SD-WAN Appliance or ││
            │  │ VPN Client (ZTN)    ││
            │  └─────────────────────┘│
            └─────────────────────────┘
 
STEP 3: COMPONENT DECISIONS
 
WAN Technology: SD-WAN + Dual ISP
- More cost-effective than MPLS for this size
- Application-aware routing for VoIP QoS
- Automatic failover between links
- Encrypted overlay for security
 
AWS Connectivity:
- Primary: AWS Site-to-Site VPN over SD-WAN fabric
- If latency-critical or high-volume: Consider Direct Connect later
- Transit Gateway for centralized cloud networking
 
Redundancy Strategy:
- HQ: Dual ISPs, active/active SD-WAN
- Branch: Primary ISP + LTE failover
- Remote: SD-WAN appliance or Zero Trust Client (Zscaler/Cloudflare)
 
Routing:
- BGP between SD-WAN edges (overlay routing)
- OSPF internally at HQ
- Static or simple at remote sites

Key Design Justifications

•SD-WAN over MPLS — For 3 sites, SD-WAN provides better cost/feature ratio. MPLS shines at 20+ sites with strict SLAs.
•LTE Backup at branch — Cost-effective redundancy for 50-user site. Full-diversity with different carrier than primary ISP.
•Zero Trust for remote team — 20 developers don't justify dedicated hardware. Cloud-delivered security (SASE) is more practical.
•AWS VPN initially — Direct Connect requires 1-month setup and commitment. Validate usage before investing.

Scenario 5: Design for High Availability

Interviewer Says:

"Our web application needs to handle 10,000 concurrent users with 99.99% uptime. We're currently on single servers for each tier. How would you design the network to support high availability?"

High Availability Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# HIGH AVAILABILITY NETWORK DESIGN
 
UNDERSTANDING THE REQUIREMENT
"99.99% uptime = 52.6 minutes downtime per year
This requires no single point of failure in critical path."
 
MULTI-TIER HA ARCHITECTURE
 
Internet
    │
    │ (Multiple ISPs for ingress diversity)
    ▼
┌─────────────────────────────────────────────────────────────┐
│                     EDGE LAYER                               │
│  ┌─────────────┐  BGP Anycast  ┌─────────────┐              │
│  │ Edge RTR-1  │◄──────────────►│ Edge RTR-2  │              │
│  │  (ISP-A)    │               │  (ISP-B)    │              │
│  └──────┬──────┘               └──────┬──────┘              │
│         │        ECMP/LAG             │                      │
│         └──────────────┬──────────────┘                      │
└────────────────────────│────────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   LOAD BALANCER LAYER                        │
│  ┌─────────────┐  VRRP/GARP   ┌─────────────┐               │
│  │   LB-1      │◄────────────►│   LB-2      │               │
│  │ (Active)    │  Health-sync │ (Standby)   │               │
│  └──────┬──────┘              └──────┬──────┘               │
│         │                            │                       │
│     VIP: 10.0.1.100 (floats between LB-1/LB-2)              │
└─────────────────────────│────────────────────────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                     WEB TIER                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Web-1    │  │ Web-2    │  │ Web-3    │  │ Web-N    │    │
│  │ (AZ-1)   │  │ (AZ-1)   │  │ (AZ-2)   │  │ (AZ-2)   │    │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
│                                                              │
│  - Deployed across Availability Zones                        │
│  - Server count based on capacity planning                   │
│  - Health checks remove failed instances                     │
└─────────────────────────────│────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   APPLICATION TIER                           │
│  (Similar pattern: multiple instances across AZs)            │
│  Internal load balancer for app tier                         │
└─────────────────────────────│────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   DATABASE TIER                              │
│  ┌────────────────┐          ┌────────────────┐             │
│  │ DB Primary     │ Sync     │ DB Replica     │             │
│  │ (AZ-1)         │◄────────►│ (AZ-2)         │             │
│  └────────────────┘ Repl.    └────────────────┘             │
│                                                              │
│  - Synchronous replication for zero data loss                │
│  - Automatic failover (Patroni, RDS Multi-AZ, etc.)         │
└──────────────────────────────────────────────────────────────┘
 
NETWORK HA ELEMENTS
 
1. EDGE REDUNDANCY
   - Multiple ISP connections with BGP
   - Different physical paths (diverse entry points)
   - Fast convergence tuning (BFD + tuned BGP timers)
 
2. CORE NETWORK REDUNDANCY
   - Dual spine switches in leaf-spine topology
   - ECMP for load distribution and failover
   - All links in LAG (Link Aggregation Groups)
 
3. LOAD BALANCER REDUNDANCY
   - Active/Standby or Active/Active pair
   - Shared VIP with VRRP or vendor equivalent
   - Session state synchronization for stateful failover
 
4. SERVER FARM REDUNDANCY
   - Minimum 2 servers per tier
   - Spread across failure domains (racks, AZs)
   - Health checks with quick detection (5-10 sec)
 
5. DATABASE REDUNDANCY
   - Synchronous replication for RPO=0
   - Automated failover for RTO < 30 seconds
   - Read replicas for read scaling (separate concern)

Interview Insight: Defense in Depth

Mention that HA isn't just about component redundancy—it's about failure domain isolation. If both web servers are on the same switch, switch failure takes both down. If both AZs share the same power grid, that's a shared failure domain. Strong answers show awareness of blast radius and failure domain thinking.

Security Scenarios

Security scenarios test both defensive thinking and understanding of attack vectors. They're increasingly common as security becomes integrated into all network roles.

Scenario 6: Suspected Network Breach

Interviewer Says:

"Our security team has detected unusual outbound traffic from a server—large data transfers to an unknown external IP at 3 AM. As the network engineer, how would you respond?"

What's Being Evaluated

• Incident response priorities (contain, preserve, investigate) • Network forensics capabilities • Understanding of data exfiltration patterns • Coordination with security team • Calm, methodical approach under pressure

Scenario 6: Incident Response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# SECURITY INCIDENT RESPONSE
 
PHASE 1: IMMEDIATE ACTIONS (CONTAIN + PRESERVE)
 
"First, I'd coordinate with security team—they may already have 
a response plan. My network-specific actions:
 
1. DON'T immediately block or shut down
   - May tip off attacker, trigger destructive action
   - Need to preserve evidence
   - Confirm with incident commander first
 
2. CAPTURE NETWORK EVIDENCE
   - Start packet capture on the server's switch port (mirror/SPAN)
   - Export NetFlow/sFlow data for the timeframe
   - Save current connection states (netstat output from server if possible)
   - Document the external IP and lookup (whois, threat intel)
 
3. CONTAIN WHEN APPROVED
   - Apply ACL to block specific external IP
   - Or: VLAN isolation (move server to quarantine VLAN)
   - Maintain logging to observe attacker response"
 
PHASE 2: INVESTIGATION
 
"From network perspective, I'd analyze:
 
1. CONNECTION ANALYSIS
   - What protocol? (80/443 might be tunneling, 22 might be SSH exfil)
   - Connection patterns (persistent vs. bursting?)
   - Volume of data transferred
 
2. HISTORICAL ANALYSIS
   - NetFlow data: Has this server talked to this IP before?
   - Has this IP communicated with any other internal hosts?
   - When did this communication pattern start?
 
3. LATERAL MOVEMENT CHECK
   - Review firewall logs for this server's internal connections
   - Has it connected to unusual internal resources?
   - Are there authentication logs from this server to other systems?"
 
PHASE 3: LONGER-TERM ACTIONS
 
"After immediate incident:
- Review and tighten egress filtering
- Implement DLP (Data Loss Prevention) if not present
- Consider DNS inspection (exfil via DNS tunneling)
- Review server's access patterns—should it have internet access?
- Network segmentation review: was this server properly isolated?"

Scenario 7: Segmentation Design

Interviewer Says:

"We want to segment our network so that the accounting department can't directly access engineering resources, and neither can access the production servers directly. How would you design this?"

Segmentation Design Approach

•Define security zones — Accounting (sensitive financial data), Engineering (IP-heavy, dev access), Production (critical, hardened), Shared Services (AD, DNS, email)
•VLAN-based isolation — Each zone in separate VLAN. L3 routing between VLANs via firewall (not L3 switch) for inspection.
•Firewall policy — Default deny between zones. Explicit allow for required traffic (e.g., Accounting → Shared Services for AD auth).
•Jump host for production — No direct access. Engineers SSH to bastion, then to production. All sessions logged.
•Micro-segmentation for production — Host-based firewalls (or VMware NSX / Kubernetes network policies) for east-west within production.
•Zero Trust consideration — If going further, implement identity-aware access rather than pure network zones. User identity + device posture + location determine access.

Segmentation Topology
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# NETWORK SEGMENTATION DESIGN
 
               ┌─────────────────────────────────────────────┐
               │              CORE FIREWALL                   │
               │  (Central policy enforcement point)         │
               └─────────────────────────────────────────────┘
                    │         │         │         │
          ┌─────────┴───┐ ┌───┴───┐ ┌───┴───┐ ┌───┴─────────┐
          ▼             ▼ ▼       ▼ ▼       ▼ ▼             ▼
┌─────────────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐
│   ACCOUNTING    │ │  ENGG   │ │ SHARED  │ │   PRODUCTION    │
│   ZONE          │ │  ZONE   │ │ SERVICES│ │   ZONE          │
│   VLAN 100      │ │ VLAN 200│ │VLAN 300 │ │   VLAN 400      │
│ 10.10.100.0/24  │ │10.10.200│ │10.10.300│ │ 10.10.400.0/24  │
├─────────────────┤ ├─────────┤ ├─────────┤ ├─────────────────┤
│ - Finance apps  │ │ - Dev   │ │ - AD/DC │ │ - App servers   │
│ - Accounting    │ │   tools │ │ - DNS   │ │ - Databases     │
│   workstations  │ │ - Git   │ │ - Email │ │ - API Gateway   │
│                 │ │ - CI/CD │ │ - File  │ │                 │
└─────────────────┘ └─────────┘ └─────────┘ └─────────────────┘
 
FIREWALL RULES (Simplified):
 
# Default: DENY all inter-zone traffic
 
# Accounting Zone Rules:
ALLOW Accounting → Shared_Services (TCP 389,636,88,53,445)  # AD/DNS/File
DENY  Accounting → Engineering
DENY  Accounting → Production
 
# Engineering Zone Rules:
ALLOW Engineering → Shared_Services (TCP 389,636,88,53)      # AD/DNS
ALLOW Engineering → Bastion_Host (TCP 22)                    # SSH to jump box
DENY  Engineering → Production (direct)                      # Must use bastion
 
# Bastion Host (in DMZ or separate segment):
ALLOW Bastion → Production (TCP 22, with session logging)
ALLOW Bastion ← Engineering (TCP 22)
# All bastion sessions logged, recorded, MFA required
 
# Shared Services:
ALLOW All_Zones → Shared_Services (DNS, AD auth ports)
# But Shared Services can initiate to zones (e.g., AD replication)

Performance Optimization Scenarios

Performance scenarios test your understanding of throughput, latency, and optimization techniques. They often involve quantitative analysis.

Scenario 8: WAN Performance

Interviewer Says:

"We have a 100 Mbps WAN link to our disaster recovery site 1000 miles away. Users are complaining that file transfers are very slow, but monitoring shows the link is only 10% utilized. What's happening?"

What's Being Evaluated

• Bandwidth-Delay Product (BDP) understanding • TCP window size limitations • WAN optimization techniques • Ability to diagnose non-obvious performance issues

Scenario 8: BDP Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# WAN PERFORMANCE ANALYSIS
 
THE PROBLEM: LOW LINK UTILIZATION WITH SLOW TRANSFERS
 
This is a classic Bandwidth-Delay Product (BDP) problem.
 
STEP 1: CALCULATE THE PHYSICS
 
Link: 100 Mbps
Distance: ~1000 miles
Estimated RTT: ~30-40ms (speed of light + router delays)
Let's use 40ms RTT
 
BDP = Bandwidth × RTT
    = 100,000,000 bits/sec × 0.040 sec
    = 4,000,000 bits
    = 500,000 bytes = 500 KB
 
This means: To fully utilize the link, we need 500 KB of 
data "in flight" (sent but not yet acknowledged) at all times.
 
STEP 2: CHECK TCP WINDOW SIZE
 
Default TCP receive window: 64 KB (without window scaling)
 
Maximum throughput = Window Size / RTT
                   = 64,000 bytes / 0.040 sec
                   = 1,600,000 bytes/sec
                   = 12.8 Mbps
 
This explains 10-15% link utilization with 100 Mbps available!
 
STEP 3: VERIFY WITH PACKET CAPTURE
 
"I'd capture packets during a file transfer and check:
- Are window scale options being negotiated?
- What's the actual advertised window size?
- Is the receiver advertising zero window? (receiver can't keep up)
- Are there retransmissions? (causing timeouts, window reduction)"
 
STEP 4: SOLUTIONS
 
1. ENABLE WINDOW SCALING (OS tuning)
   Linux: 
   sysctl -w net.ipv4.tcp_window_scaling=1
   sysctl -w net.core.rmem_max=16777216
   sysctl -w net.core.wmem_max=16777216
   
   Windows:
   netsh int tcp set global autotuninglevel=normal
 
2. WAN OPTIMIZATION APPLIANCES
   - Data deduplication (only send unique data blocks)
   - Protocol spoofing (local ACKs, eliminates RTT impact)
   - Compression (reduce data volume)
   
   Example: Riverbed, Silver Peak/Aruba, Cisco WAAS
 
3. APPLICATION-LEVEL SOLUTIONS
   - Parallel transfers (multiple TCP connections)
   - Use UDP-based transfer protocols (Aspera)
   - Pre-positioning data during off-hours

Quick BDP Rule of Thumb

For every 10ms of RTT on a 100 Mbps link, you need ~125 KB of TCP window to fully utilize the bandwidth. If window size is limited, calculate: Max throughput = Window / RTT. This formula explains why satellite links (600ms RTT) and transcontinental links are so challenging for single TCP streams.

Scenario 9: Microbursting

Interviewer Says:

"Our monitoring shows a 1 Gbps link is averaging 200 Mbps, well below capacity. But applications complain about packet drops, and we see brief interface output drops in switch statistics. What's going on?"

Microbursting Explanation

•Microbursting defined — Traffic arriving in short, intense bursts that exceed interface capacity momentarily, even though average utilization is low.
•Why monitoring misses it — Most monitoring polls every 1-5 minutes. A 50ms burst filling buffers completely is invisible to average measurements.
•Buffer exhaustion — Switch interfaces have finite buffers (maybe 100-500 μs worth). Burst exceeding buffer = drops.
•Common causes — Application sending large batches, many-to-one traffic patterns (incast), timer-aligned operations (cron jobs, monitoring pulls).

Microbursting Solutions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# MICROBURSTING DIAGNOSIS AND SOLUTIONS
 
DIAGNOSIS
 
1. Check interface counters for drops
   show interface gi0/1 | include output drops
   
2. Use sub-second monitoring if available
   - Some switches support 1-second interface statistics
   - Streaming telemetry can capture bursts
 
3. Identify traffic patterns
   - NetFlow with short active timeouts
   - Packet capture with timestamps
 
SOLUTIONS
 
1. INCREASE BUFFER (if possible)
   - Some switches allow buffer allocation per port
   - Trade-off: More buffer = more latency
 
2. TRAFFIC SHAPING
   - Shape outbound traffic to smooth bursts
   - Example: Cisco MQC shaping config
   
   policy-map SHAPER
     class class-default
       shape average 800000000  # Shape to 800 Mbps
       
3. UPGRADE LINK SPEED
   - 10 Gbps link can absorb 1 Gbps bursts
   - Buffer provides more "time worth" at higher speed
 
4. SPREAD THE LOAD
   - Stagger application timers
   - Use multiple egress paths (ECMP)
   - Randomize batch job start times
 
5. QoS PRIORITIZATION
   - Prioritize latency-sensitive traffic
   - Let burst-tolerant traffic absorb drops
 
   policy-map QOS-POLICY
     class VOICE
       priority percent 20
     class BUSINESS-CRITICAL  
       bandwidth percent 30
       random-detect
     class class-default
       fair-queue

Cloud and Hybrid Scenarios

Cloud networking questions are increasingly common as organizations adopt hybrid architectures. These scenarios test understanding of cloud networking constructs and their mapping to traditional concepts.

Scenario 10: Cloud Migration Networking

Interviewer Says:

"We're migrating our web application to AWS. It needs to access our on-premises database during migration, and we want to keep the database on-prem permanently for compliance. How would you design the connectivity?"

Hybrid Cloud Connectivity Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# HYBRID CLOUD CONNECTIVITY DESIGN
 
REQUIREMENTS ANALYSIS
- Web app in AWS needs to access on-prem database
- Database must stay on-prem (compliance)
- Need secure, reliable, low-latency connection
- Migration phase + long-term steady state
 
CONNECTIVITY OPTIONS
 
┌─────────────────────────────────────────────────────────────────────────┐
│ OPTION 1: AWS Site-to-Site VPN                                         │
├─────────────────────────────────────────────────────────────────────────┤
│ Pros:                           │ Cons:                                 │
│ - Fast to deploy (~1 hour)      │ - Shared internet path (variable     │
│ - Low cost ($0.05/hr/tunnel)    │   latency, ~20-50ms typically)       │
│ - Redundant tunnels available   │ - Max 1.25 Gbps per tunnel           │
│ - Encrypted by default          │ - Internet dependency                 │
├─────────────────────────────────────────────────────────────────────────┤
│ Best for: Proof of concept, dev/test, lower-bandwidth production       │
└─────────────────────────────────────────────────────────────────────────┘
 
┌─────────────────────────────────────────────────────────────────────────┐
│ OPTION 2: AWS Direct Connect                                            │
├─────────────────────────────────────────────────────────────────────────┤
│ Pros:                           │ Cons:                                 │
│ - Dedicated bandwidth (1/10/    │ - 2-4 week provisioning time          │
│   100 Gbps)                     │ - Monthly commitment + port fees      │
│ - Consistent latency            │ - Requires cross-connect at colocation│
│ - Lower data transfer costs     │ - Single path (add redundancy extra)  │
├─────────────────────────────────────────────────────────────────────────┤
│ Best for: Production, high-bandwidth, latency-sensitive, cost at scale │
└─────────────────────────────────────────────────────────────────────────┘
 
RECOMMENDED ARCHITECTURE
 
            On-Premises                         AWS Cloud
         ┌─────────────────┐              ┌─────────────────────────┐
         │                 │              │         VPC             │
         │  ┌───────────┐  │   Direct    │  ┌─────────────────────┐│
         │  │ Database  │  │   Connect   │  │   Private Subnet    ││
         │  │ Servers   │  │ ◄──────────►│  │  ┌───────────────┐  ││
         │  └───────────┘  │   1 Gbps    │  │  │  Web App ECS  │  ││
         │       │         │              │  │  │  (Fargate)    │  ││
         │       ▼         │              │  │  └───────────────┘  ││
         │  ┌───────────┐  │   VPN       │  └─────────────────────┘│
         │  │ On-Prem   │  │ (Backup)   │           │              │
         │  │ Firewall  │──┼─────────────┼──►VPN GW │              │
         │  └───────────┘  │   ◄─────────┼──────────┘              │
         │       │         │              │                        │
         │       ▼         │              │  ┌─────────────────┐   │
         │  ┌───────────┐  │              │  │ Transit Gateway │   │
         │  │ Router to │  │              │  │ (central hub)   │   │
         │  │ DX Location│ │              │  └─────────────────┘   │
         │  └───────────┘  │              └─────────────────────────┘
         └─────────────────┘
 
KEY DESIGN DECISIONS
 
1. Use Direct Connect (1 Gbps) as primary for production
   - Low latency for database queries
   - Predictable performance
   
2. VPN as backup (automatic failover via BGP)
   - Covers Direct Connect maintenance windows
   - Faster to provision initially
   
3. Transit Gateway as hub
   - Future-proofs for additional VPCs
   - Centralized routing and security
 
4. Private subnet for application
   - No direct internet access from app tier
   - Outbound via NAT Gateway if needed
 
5. Security controls
   - Security Groups: Allow only DB port (3306/5432) from app subnet
   - On-prem firewall: Allow only from known AWS CIDR ranges
   - Encryption: Consider database connection TLS in addition to DX/MACsec

Cloud Networking Vocabulary

Know the cloud equivalents: VPC = Virtual LAN/Network, Subnet = VLAN segment, Security Group = Stateful host firewall, NACL = Stateless subnet ACL, Internet Gateway = Edge router to internet, NAT Gateway = PAT for private subnets, Transit Gateway = WAN hub for VPC interconnection, Peering = Direct VPC-to-VPC link.

Practical Tips for Scenario Questions

Beyond specific technical knowledge, certain approaches consistently help in scenario-based questions. These meta-strategies make you more effective regardless of the specific scenario.

Universal Scenario Strategies

•Always clarify scope — 'Before I dive in, can I ask a few clarifying questions?' Interviewers expect this; jumping straight to answers looks amateur.
•State your assumptions — 'I'm assuming this is a Linux server environment based on the symptoms described. Does that match your environment?'
•Verbalize trade-offs — 'We could do X which is faster to implement, or Y which is more robust. Given the time constraints you mentioned, I'd recommend X initially with a plan to implement Y.'
•Show prioritization — 'If this is an active production incident, my first priority is restoration of service, then root cause analysis. Let me walk through the immediate steps first.'
•Acknowledge what you don't know — 'I haven't worked with vendor X's specific implementation, but the underlying protocol works similarly across vendors. Let me describe the approach conceptually.'
•Offer to go deeper — 'I can go deeper into any of these areas—would you like me to elaborate on the routing design, or shall we move to the security aspects?'

Scenario Response Quality Levels
Aspect	Junior Response	Senior Response	Principal Response
Problem Framing	Accepts problem as stated	Asks clarifying questions	Identifies unstated assumptions and constraints
Solution Approach	Provides a single solution	Compares 2-3 options with trade-offs	Considers organizational/political factors too
Technical Depth	High-level, conceptual	Specific commands, configs	Design patterns, architecture implications
Risk Awareness	Focuses on solving problem	Notes potential risks of solution	Proposes mitigation strategies proactively
Communication	Technical details only	Technical + business impact	Tailored to audience, executive-ready summary

The 'What Would You Do' Pattern

Many scenarios end with: 'What would you do next?' If you've covered immediate troubleshooting, appropriate responses include: (1) Documentation—'Document the incident for future reference and post-mortem.' (2) Monitoring—'Set up alerting to catch this earlier next time.' (3) Prevention—'Review why this wasn't caught in testing and improve our processes.' (4) Knowledge sharing—'Share findings with the team so others learn from this.'

Summary: Scenario Mastery

We've covered a range of practical scenarios that represent real interview challenges. The key is not memorizing specific answers but developing a systematic approach that works for any scenario.

Key Takeaways

•Troubleshooting — Use systematic methodology (CLEAR framework), isolate by layer, quantify problems with measurements.
•Design — Start with requirements, present options with trade-offs, justify choices based on constraints.
•Security — Contain before investigating, preserve evidence, think about detection and prevention for future.
•Performance — Understand BDP, recognize microbursting, know the difference between bandwidth and throughput.
•Cloud/Hybrid — Map cloud constructs to traditional networking, understand connectivity options and their trade-offs.
•Meta-strategies — Clarify, assume explicitly, show trade-off thinking, prioritize, acknowledge limits, offer depth.

Page Complete

You now have practical experience with common interview scenarios. The final page covers Career Guidance—how to navigate the network engineering career path, target appropriate roles, and continue your professional development after landing the job.

Practical Interview Scenarios

Real-World Interview Scenarios

This page presents the most common practical scenarios encountered in network engineering interviews, complete with:

The scenario as presented — Exactly what the interviewer says
What's being evaluated — The hidden competencies being assessed
A structured response approach — How to tackle the problem systematically
Model answer with reasoning — An expert-level response
Common mistakes to avoid — Traps that catch unprepared candidates

What You Will Learn

Troubleshooting Scenarios

Scenario 1: The Slow Application

Interviewer Says:

"Users are reporting that our internal CRM application is very slow. Sometimes it takes 30 seconds to load a page. The application team says the server is fine. How would you investigate?"

What's Being Evaluated

Scenario 1: Structured Response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# STRUCTURED RESPONSE APPROACH
 
STEP 1: GATHER INFORMATION
"First, I'd clarify the symptoms:
- Is it slow for all users, or specific groups/locations?
- Is it slow at certain times, or consistently?
- When did it start? Any recent changes?
- Are other applications affected?
 
Let's say: It's slow for all users, started 3 days ago, 
only the CRM is affected, no known recent changes."
 
STEP 2: ISOLATE THE PROBLEM DOMAIN
"Since the app team says the server is fine, let's verify 
that and narrow down network vs. application:
 
I'd run tests from a workstation:
- ping crm-server.internal          → Tests basic L3 reachability
- traceroute crm-server.internal    → Identifies slow hop if any
- curl -w '%{time_connect} %{time_starttransfer} %{time_total}' https://crm/healthcheck
 
The curl timing breakdown shows:
- time_connect: TCP handshake latency
- time_starttransfer: Time to first byte (server processing)
- time_total: Full request time"
 
STEP 3: INTERPRET AND DRILL DOWN
"Suppose results show:
- Ping: 2ms (normal)
- Traceroute: All hops <5ms (normal)
- Curl: connect=0.002s, starttransfer=28s, total=28.5s
 
This tells me: Network is fine (fast connect), but time-to-first-byte 
is 28 seconds. The delay is in server processing, not network.
 
I'd push back to the app team with data: 'Network latency is 2ms, 
but time-to-first-byte is 28 seconds. Can you check database 
queries or external service calls?'"
 
STEP 4: ALTERNATIVE PATH (if network symptoms)
"If curl showed connect=25s, that indicates TCP handshake delay.
I'd investigate:
- DNS resolution time (dig crm-server.internal)
- Firewall processing (any new rules? Rate limiting?)
- Server load (is it ACKing slowly?)
- MTU issues (packet fragmentation, retransmissions)"
 
STEP 5: CONCLUDE WITH VERIFICATION
"Once root cause is found and fixed, I'd:
- Re-run timing tests to confirm improvement
- Set up monitoring for this metric going forward
- Document the incident for future reference"

Common Mistakes to Avoid

•Immediately blaming the network — Without evidence, this damages credibility with other teams
•Not quantifying the problem — 'Slow' is subjective; measure actual latencies
•Skipping clarifying questions — Different symptoms suggest different causes
•Using only ping — Ping only tests L3/ICMP; many problems are in upper layers

Scenario 2: Intermittent Connectivity

Interviewer Says:

"We're getting reports of intermittent connectivity issues. Some users lose connection to everything, then it works again a minute later. This has been happening for a week. How would you approach this?"

What's Being Evaluated

Scenario 2: Structured Response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# INTERMITTENT CONNECTIVITY INVESTIGATION
 
PHASE 1: PATTERN IDENTIFICATION
"Intermittent issues require pattern analysis:
- Are the same users always affected, or random users?
- Is there a pattern in timing (time of day, duration)?
- Does it correlate with any other events (backups, scans)?
- What 'everything' means: Internet? Internal only? Both?
 
Key question: When connectivity fails, can users ping their 
default gateway? This isolates L2/L3 local issues from routing."
 
PHASE 2: HYPOTHESIS FORMATION
 
Based on symptoms, top hypotheses:
 
1. SPANNING TREE RECONVERGENCE
   - Symptoms: All users on a switch/VLAN lose connectivity briefly
   - Cause: Topology change → STP recalculates → 30-50s outage
   - Check: Switch logs for topology change notifications
   - Often caused by: Unmanaged switch plugged in, port flapping
 
2. DHCP ISSUES
   - Symptoms: Users lose connectivity, 'Network Limited'
   - Cause: Lease renewal failures, rogue DHCP, IP conflicts
   - Check: DHCP server logs, scope exhaustion, lease times
 
3. ARP TABLE ISSUES
   - Symptoms: Users can ping gateway IP but not beyond
   - Cause: ARP cache poisoning, duplicate IPs, flapping
   - Check: ARP tables on switches and gateway, gratuitous ARP
 
4. DEFAULT GATEWAY REDUNDANCY
   - Symptoms: Traffic fails during failover
   - Cause: VRRP/HSRP misconfiguration, preemption battles
   - Check: FHRP logs, virtual IP advertisement
 
PHASE 3: DATA COLLECTION
"I'd collect:
- Time-correlated logs from affected switches
- Spanning tree events: show spanning-tree detail
- MAC address table changes: show mac address-table count
- Router/switch CPU utilization (high CPU → slow to respond)
- Syslog correlation across infrastructure
 
For STP specifically: enable 'spanning-tree logging' and
look for TCN (Topology Change Notification) events."
 
PHASE 4: ROOT CAUSE AND FIX
"If STP confirmed:
- Identify the port causing topology changes
- Enable BPDU Guard on access ports
- Enable PortFast on access ports
- Consider RSTP if using legacy 802.1D"

Scenario 3: Can't Reach a Specific Server

Interviewer Says:

"A developer says they can't reach a server at 10.50.20.100 from their workstation. But they can reach other servers in that same 10.50.20.0/24 network. What do you check?"

Quick Triage Approach

•Verify from your machine — 'First, let me try to reach 10.50.20.100 myself to confirm the issue isn't isolated to their workstation.'
•Check if server is UP — Can you reach the server from its own subnet? Is it responding to its default gateway?
•ARP resolution — On the gateway for 10.50.20.0/24, is there an ARP entry for .100? If not, the server may be down or on wrong VLAN.
•Duplicate IP check — Are there two different MACs associated with .100? Run 'arp -a' repeatedly.
•Host-based firewall — Server might be up but blocking. Can you SSH/RDP to it from local subnet?
•ACL/Firewall in path — Is there a firewall between source and destination that might have a rule blocking this specific host?

Key Insight

Network Design Scenarios

Design scenarios test architectural thinking, ability to balance trade-offs, and understanding of real-world constraints. These are common in senior and architect-level interviews.

Scenario 4: Design a Multi-Site Enterprise Network

Interviewer Says:

"We're a company with 3 offices: headquarters (500 users), a branch office (50 users), and a remote development team (20 users). We have an on-prem data center at HQ and use AWS for some workloads. Design the network connectivity."

What's Being Evaluated

Scenario 4: Design Approach
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# NETWORK DESIGN SCENARIO RESPONSE
 
STEP 1: REQUIREMENTS GATHERING
"Before designing, I'd ask:
- What applications do remote sites access? Latency-sensitive?
- What bandwidth is needed at each site?
- Uptime requirements? Is 99.9% acceptable, or need 99.99%?
- Budget constraints? Enterprise WAN can be expensive.
- Compliance requirements? Any data that can't traverse internet?
- Future growth? Are additional sites planned?
 
Assumptions for this exercise:
- Standard office applications + VoIP at branches
- 100 Mbps at HQ, 25 Mbps at branch, 10 Mbps at remote site
- 99.9% uptime acceptable
- Cost-conscious but not minimal
- No strict compliance requiring private circuits"
 
STEP 2: HIGH-LEVEL ARCHITECTURE
 
┌─────────────────────────────────────────────────────────────────────┐
│                         HEADQUARTERS (500 users)                    │
│  ┌─────────────────┐    ┌─────────────────┐    ┌────────────────┐  │
│  │ Core Switches   │────│  Firewalls (HA) │────│  WAN Edge      │  │
│  │ (L3, redundant) │    │  (Active/Stby)  │    │  (SD-WAN)      │  │
│  └────────┬────────┘    └────────┬────────┘    └────────┬───────┘  │
│           │                      │                      │          │
│           ├──────────────────────┼──────────────────────┤          │
│           │                      │                      │          │
│  ┌────────▼────────┐    ┌────────▼────────┐    ┌───────▼────────┐  │
│  │ Data Center     │    │ DMZ             │    │ Primary ISP    │  │
│  │ (on-prem apps)  │    │ (web servers)   │    │ + Backup ISP   │  │
│  └─────────────────┘    └─────────────────┘    └───────┬────────┘  │
└────────────────────────────────────────────────────────│────────────┘
                                                         │
                    ┌────────────────────────────────────┼────────────┐
                    │                                    │            │
            ┌───────▼───────┐               ┌───────────▼──────────┐ │
            │  AWS Cloud    │               │  Branch Office       │ │
            │  ┌─────────┐  │               │  ┌─────────────────┐ │ │
            │  │ VPC     │  │               │  │ SD-WAN Edge     │ │ │
            │  │ Transit │  │               │  │ + Local ISP     │ │ │
            │  │ Gateway │  │               │  │ + LTE Backup    │ │ │
            │  └─────────┘  │               │  └─────────────────┘ │ │
            └───────────────┘               └──────────────────────┘ │
                    │                                                │
                    │         ┌──────────────────────────────────────┘
                    │         │
            ┌───────▼─────────▼───────┐
            │  Remote Dev Team        │
            │  ┌─────────────────────┐│
            │  │ SD-WAN Appliance or ││
            │  │ VPN Client (ZTN)    ││
            │  └─────────────────────┘│
            └─────────────────────────┘
 
STEP 3: COMPONENT DECISIONS
 
WAN Technology: SD-WAN + Dual ISP
- More cost-effective than MPLS for this size
- Application-aware routing for VoIP QoS
- Automatic failover between links
- Encrypted overlay for security
 
AWS Connectivity:
- Primary: AWS Site-to-Site VPN over SD-WAN fabric
- If latency-critical or high-volume: Consider Direct Connect later
- Transit Gateway for centralized cloud networking
 
Redundancy Strategy:
- HQ: Dual ISPs, active/active SD-WAN
- Branch: Primary ISP + LTE failover
- Remote: SD-WAN appliance or Zero Trust Client (Zscaler/Cloudflare)
 
Routing:
- BGP between SD-WAN edges (overlay routing)
- OSPF internally at HQ
- Static or simple at remote sites

Key Design Justifications

•SD-WAN over MPLS — For 3 sites, SD-WAN provides better cost/feature ratio. MPLS shines at 20+ sites with strict SLAs.
•LTE Backup at branch — Cost-effective redundancy for 50-user site. Full-diversity with different carrier than primary ISP.
•Zero Trust for remote team — 20 developers don't justify dedicated hardware. Cloud-delivered security (SASE) is more practical.
•AWS VPN initially — Direct Connect requires 1-month setup and commitment. Validate usage before investing.

Scenario 5: Design for High Availability

Interviewer Says:

"Our web application needs to handle 10,000 concurrent users with 99.99% uptime. We're currently on single servers for each tier. How would you design the network to support high availability?"

High Availability Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# HIGH AVAILABILITY NETWORK DESIGN
 
UNDERSTANDING THE REQUIREMENT
"99.99% uptime = 52.6 minutes downtime per year
This requires no single point of failure in critical path."
 
MULTI-TIER HA ARCHITECTURE
 
Internet
    │
    │ (Multiple ISPs for ingress diversity)
    ▼
┌─────────────────────────────────────────────────────────────┐
│                     EDGE LAYER                               │
│  ┌─────────────┐  BGP Anycast  ┌─────────────┐              │
│  │ Edge RTR-1  │◄──────────────►│ Edge RTR-2  │              │
│  │  (ISP-A)    │               │  (ISP-B)    │              │
│  └──────┬──────┘               └──────┬──────┘              │
│         │        ECMP/LAG             │                      │
│         └──────────────┬──────────────┘                      │
└────────────────────────│────────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   LOAD BALANCER LAYER                        │
│  ┌─────────────┐  VRRP/GARP   ┌─────────────┐               │
│  │   LB-1      │◄────────────►│   LB-2      │               │
│  │ (Active)    │  Health-sync │ (Standby)   │               │
│  └──────┬──────┘              └──────┬──────┘               │
│         │                            │                       │
│     VIP: 10.0.1.100 (floats between LB-1/LB-2)              │
└─────────────────────────│────────────────────────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                     WEB TIER                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Web-1    │  │ Web-2    │  │ Web-3    │  │ Web-N    │    │
│  │ (AZ-1)   │  │ (AZ-1)   │  │ (AZ-2)   │  │ (AZ-2)   │    │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
│                                                              │
│  - Deployed across Availability Zones                        │
│  - Server count based on capacity planning                   │
│  - Health checks remove failed instances                     │
└─────────────────────────────│────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   APPLICATION TIER                           │
│  (Similar pattern: multiple instances across AZs)            │
│  Internal load balancer for app tier                         │
└─────────────────────────────│────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   DATABASE TIER                              │
│  ┌────────────────┐          ┌────────────────┐             │
│  │ DB Primary     │ Sync     │ DB Replica     │             │
│  │ (AZ-1)         │◄────────►│ (AZ-2)         │             │
│  └────────────────┘ Repl.    └────────────────┘             │
│                                                              │
│  - Synchronous replication for zero data loss                │
│  - Automatic failover (Patroni, RDS Multi-AZ, etc.)         │
└──────────────────────────────────────────────────────────────┘
 
NETWORK HA ELEMENTS
 
1. EDGE REDUNDANCY
   - Multiple ISP connections with BGP
   - Different physical paths (diverse entry points)
   - Fast convergence tuning (BFD + tuned BGP timers)
 
2. CORE NETWORK REDUNDANCY
   - Dual spine switches in leaf-spine topology
   - ECMP for load distribution and failover
   - All links in LAG (Link Aggregation Groups)
 
3. LOAD BALANCER REDUNDANCY
   - Active/Standby or Active/Active pair
   - Shared VIP with VRRP or vendor equivalent
   - Session state synchronization for stateful failover
 
4. SERVER FARM REDUNDANCY
   - Minimum 2 servers per tier
   - Spread across failure domains (racks, AZs)
   - Health checks with quick detection (5-10 sec)
 
5. DATABASE REDUNDANCY
   - Synchronous replication for RPO=0
   - Automated failover for RTO < 30 seconds
   - Read replicas for read scaling (separate concern)

Interview Insight: Defense in Depth

Security Scenarios

Security scenarios test both defensive thinking and understanding of attack vectors. They're increasingly common as security becomes integrated into all network roles.

Scenario 6: Suspected Network Breach

Interviewer Says:

"Our security team has detected unusual outbound traffic from a server—large data transfers to an unknown external IP at 3 AM. As the network engineer, how would you respond?"

What's Being Evaluated

Scenario 6: Incident Response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# SECURITY INCIDENT RESPONSE
 
PHASE 1: IMMEDIATE ACTIONS (CONTAIN + PRESERVE)
 
"First, I'd coordinate with security team—they may already have 
a response plan. My network-specific actions:
 
1. DON'T immediately block or shut down
   - May tip off attacker, trigger destructive action
   - Need to preserve evidence
   - Confirm with incident commander first
 
2. CAPTURE NETWORK EVIDENCE
   - Start packet capture on the server's switch port (mirror/SPAN)
   - Export NetFlow/sFlow data for the timeframe
   - Save current connection states (netstat output from server if possible)
   - Document the external IP and lookup (whois, threat intel)
 
3. CONTAIN WHEN APPROVED
   - Apply ACL to block specific external IP
   - Or: VLAN isolation (move server to quarantine VLAN)
   - Maintain logging to observe attacker response"
 
PHASE 2: INVESTIGATION
 
"From network perspective, I'd analyze:
 
1. CONNECTION ANALYSIS
   - What protocol? (80/443 might be tunneling, 22 might be SSH exfil)
   - Connection patterns (persistent vs. bursting?)
   - Volume of data transferred
 
2. HISTORICAL ANALYSIS
   - NetFlow data: Has this server talked to this IP before?
   - Has this IP communicated with any other internal hosts?
   - When did this communication pattern start?
 
3. LATERAL MOVEMENT CHECK
   - Review firewall logs for this server's internal connections
   - Has it connected to unusual internal resources?
   - Are there authentication logs from this server to other systems?"
 
PHASE 3: LONGER-TERM ACTIONS
 
"After immediate incident:
- Review and tighten egress filtering
- Implement DLP (Data Loss Prevention) if not present
- Consider DNS inspection (exfil via DNS tunneling)
- Review server's access patterns—should it have internet access?
- Network segmentation review: was this server properly isolated?"

Scenario 7: Segmentation Design

Interviewer Says:

"We want to segment our network so that the accounting department can't directly access engineering resources, and neither can access the production servers directly. How would you design this?"

Segmentation Design Approach

•Define security zones — Accounting (sensitive financial data), Engineering (IP-heavy, dev access), Production (critical, hardened), Shared Services (AD, DNS, email)
•VLAN-based isolation — Each zone in separate VLAN. L3 routing between VLANs via firewall (not L3 switch) for inspection.
•Firewall policy — Default deny between zones. Explicit allow for required traffic (e.g., Accounting → Shared Services for AD auth).
•Jump host for production — No direct access. Engineers SSH to bastion, then to production. All sessions logged.
•Micro-segmentation for production — Host-based firewalls (or VMware NSX / Kubernetes network policies) for east-west within production.
•Zero Trust consideration — If going further, implement identity-aware access rather than pure network zones. User identity + device posture + location determine access.

Segmentation Topology
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# NETWORK SEGMENTATION DESIGN
 
               ┌─────────────────────────────────────────────┐
               │              CORE FIREWALL                   │
               │  (Central policy enforcement point)         │
               └─────────────────────────────────────────────┘
                    │         │         │         │
          ┌─────────┴───┐ ┌───┴───┐ ┌───┴───┐ ┌───┴─────────┐
          ▼             ▼ ▼       ▼ ▼       ▼ ▼             ▼
┌─────────────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐
│   ACCOUNTING    │ │  ENGG   │ │ SHARED  │ │   PRODUCTION    │
│   ZONE          │ │  ZONE   │ │ SERVICES│ │   ZONE          │
│   VLAN 100      │ │ VLAN 200│ │VLAN 300 │ │   VLAN 400      │
│ 10.10.100.0/24  │ │10.10.200│ │10.10.300│ │ 10.10.400.0/24  │
├─────────────────┤ ├─────────┤ ├─────────┤ ├─────────────────┤
│ - Finance apps  │ │ - Dev   │ │ - AD/DC │ │ - App servers   │
│ - Accounting    │ │   tools │ │ - DNS   │ │ - Databases     │
│   workstations  │ │ - Git   │ │ - Email │ │ - API Gateway   │
│                 │ │ - CI/CD │ │ - File  │ │                 │
└─────────────────┘ └─────────┘ └─────────┘ └─────────────────┘
 
FIREWALL RULES (Simplified):
 
# Default: DENY all inter-zone traffic
 
# Accounting Zone Rules:
ALLOW Accounting → Shared_Services (TCP 389,636,88,53,445)  # AD/DNS/File
DENY  Accounting → Engineering
DENY  Accounting → Production
 
# Engineering Zone Rules:
ALLOW Engineering → Shared_Services (TCP 389,636,88,53)      # AD/DNS
ALLOW Engineering → Bastion_Host (TCP 22)                    # SSH to jump box
DENY  Engineering → Production (direct)                      # Must use bastion
 
# Bastion Host (in DMZ or separate segment):
ALLOW Bastion → Production (TCP 22, with session logging)
ALLOW Bastion ← Engineering (TCP 22)
# All bastion sessions logged, recorded, MFA required
 
# Shared Services:
ALLOW All_Zones → Shared_Services (DNS, AD auth ports)
# But Shared Services can initiate to zones (e.g., AD replication)

Performance Optimization Scenarios

Performance scenarios test your understanding of throughput, latency, and optimization techniques. They often involve quantitative analysis.

Scenario 8: WAN Performance

Interviewer Says:

"We have a 100 Mbps WAN link to our disaster recovery site 1000 miles away. Users are complaining that file transfers are very slow, but monitoring shows the link is only 10% utilized. What's happening?"

What's Being Evaluated

• Bandwidth-Delay Product (BDP) understanding • TCP window size limitations • WAN optimization techniques • Ability to diagnose non-obvious performance issues

Scenario 8: BDP Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# WAN PERFORMANCE ANALYSIS
 
THE PROBLEM: LOW LINK UTILIZATION WITH SLOW TRANSFERS
 
This is a classic Bandwidth-Delay Product (BDP) problem.
 
STEP 1: CALCULATE THE PHYSICS
 
Link: 100 Mbps
Distance: ~1000 miles
Estimated RTT: ~30-40ms (speed of light + router delays)
Let's use 40ms RTT
 
BDP = Bandwidth × RTT
    = 100,000,000 bits/sec × 0.040 sec
    = 4,000,000 bits
    = 500,000 bytes = 500 KB
 
This means: To fully utilize the link, we need 500 KB of 
data "in flight" (sent but not yet acknowledged) at all times.
 
STEP 2: CHECK TCP WINDOW SIZE
 
Default TCP receive window: 64 KB (without window scaling)
 
Maximum throughput = Window Size / RTT
                   = 64,000 bytes / 0.040 sec
                   = 1,600,000 bytes/sec
                   = 12.8 Mbps
 
This explains 10-15% link utilization with 100 Mbps available!
 
STEP 3: VERIFY WITH PACKET CAPTURE
 
"I'd capture packets during a file transfer and check:
- Are window scale options being negotiated?
- What's the actual advertised window size?
- Is the receiver advertising zero window? (receiver can't keep up)
- Are there retransmissions? (causing timeouts, window reduction)"
 
STEP 4: SOLUTIONS
 
1. ENABLE WINDOW SCALING (OS tuning)
   Linux: 
   sysctl -w net.ipv4.tcp_window_scaling=1
   sysctl -w net.core.rmem_max=16777216
   sysctl -w net.core.wmem_max=16777216
   
   Windows:
   netsh int tcp set global autotuninglevel=normal
 
2. WAN OPTIMIZATION APPLIANCES
   - Data deduplication (only send unique data blocks)
   - Protocol spoofing (local ACKs, eliminates RTT impact)
   - Compression (reduce data volume)
   
   Example: Riverbed, Silver Peak/Aruba, Cisco WAAS
 
3. APPLICATION-LEVEL SOLUTIONS
   - Parallel transfers (multiple TCP connections)
   - Use UDP-based transfer protocols (Aspera)
   - Pre-positioning data during off-hours

Quick BDP Rule of Thumb

Scenario 9: Microbursting

Interviewer Says:

"Our monitoring shows a 1 Gbps link is averaging 200 Mbps, well below capacity. But applications complain about packet drops, and we see brief interface output drops in switch statistics. What's going on?"

Microbursting Explanation

•Microbursting defined — Traffic arriving in short, intense bursts that exceed interface capacity momentarily, even though average utilization is low.
•Why monitoring misses it — Most monitoring polls every 1-5 minutes. A 50ms burst filling buffers completely is invisible to average measurements.
•Buffer exhaustion — Switch interfaces have finite buffers (maybe 100-500 μs worth). Burst exceeding buffer = drops.
•Common causes — Application sending large batches, many-to-one traffic patterns (incast), timer-aligned operations (cron jobs, monitoring pulls).

Microbursting Solutions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# MICROBURSTING DIAGNOSIS AND SOLUTIONS
 
DIAGNOSIS
 
1. Check interface counters for drops
   show interface gi0/1 | include output drops
   
2. Use sub-second monitoring if available
   - Some switches support 1-second interface statistics
   - Streaming telemetry can capture bursts
 
3. Identify traffic patterns
   - NetFlow with short active timeouts
   - Packet capture with timestamps
 
SOLUTIONS
 
1. INCREASE BUFFER (if possible)
   - Some switches allow buffer allocation per port
   - Trade-off: More buffer = more latency
 
2. TRAFFIC SHAPING
   - Shape outbound traffic to smooth bursts
   - Example: Cisco MQC shaping config
   
   policy-map SHAPER
     class class-default
       shape average 800000000  # Shape to 800 Mbps
       
3. UPGRADE LINK SPEED
   - 10 Gbps link can absorb 1 Gbps bursts
   - Buffer provides more "time worth" at higher speed
 
4. SPREAD THE LOAD
   - Stagger application timers
   - Use multiple egress paths (ECMP)
   - Randomize batch job start times
 
5. QoS PRIORITIZATION
   - Prioritize latency-sensitive traffic
   - Let burst-tolerant traffic absorb drops
 
   policy-map QOS-POLICY
     class VOICE
       priority percent 20
     class BUSINESS-CRITICAL  
       bandwidth percent 30
       random-detect
     class class-default
       fair-queue

Cloud and Hybrid Scenarios

Scenario 10: Cloud Migration Networking

Interviewer Says:

"We're migrating our web application to AWS. It needs to access our on-premises database during migration, and we want to keep the database on-prem permanently for compliance. How would you design the connectivity?"

Hybrid Cloud Connectivity Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# HYBRID CLOUD CONNECTIVITY DESIGN
 
REQUIREMENTS ANALYSIS
- Web app in AWS needs to access on-prem database
- Database must stay on-prem (compliance)
- Need secure, reliable, low-latency connection
- Migration phase + long-term steady state
 
CONNECTIVITY OPTIONS
 
┌─────────────────────────────────────────────────────────────────────────┐
│ OPTION 1: AWS Site-to-Site VPN                                         │
├─────────────────────────────────────────────────────────────────────────┤
│ Pros:                           │ Cons:                                 │
│ - Fast to deploy (~1 hour)      │ - Shared internet path (variable     │
│ - Low cost ($0.05/hr/tunnel)    │   latency, ~20-50ms typically)       │
│ - Redundant tunnels available   │ - Max 1.25 Gbps per tunnel           │
│ - Encrypted by default          │ - Internet dependency                 │
├─────────────────────────────────────────────────────────────────────────┤
│ Best for: Proof of concept, dev/test, lower-bandwidth production       │
└─────────────────────────────────────────────────────────────────────────┘
 
┌─────────────────────────────────────────────────────────────────────────┐
│ OPTION 2: AWS Direct Connect                                            │
├─────────────────────────────────────────────────────────────────────────┤
│ Pros:                           │ Cons:                                 │
│ - Dedicated bandwidth (1/10/    │ - 2-4 week provisioning time          │
│   100 Gbps)                     │ - Monthly commitment + port fees      │
│ - Consistent latency            │ - Requires cross-connect at colocation│
│ - Lower data transfer costs     │ - Single path (add redundancy extra)  │
├─────────────────────────────────────────────────────────────────────────┤
│ Best for: Production, high-bandwidth, latency-sensitive, cost at scale │
└─────────────────────────────────────────────────────────────────────────┘
 
RECOMMENDED ARCHITECTURE
 
            On-Premises                         AWS Cloud
         ┌─────────────────┐              ┌─────────────────────────┐
         │                 │              │         VPC             │
         │  ┌───────────┐  │   Direct    │  ┌─────────────────────┐│
         │  │ Database  │  │   Connect   │  │   Private Subnet    ││
         │  │ Servers   │  │ ◄──────────►│  │  ┌───────────────┐  ││
         │  └───────────┘  │   1 Gbps    │  │  │  Web App ECS  │  ││
         │       │         │              │  │  │  (Fargate)    │  ││
         │       ▼         │              │  │  └───────────────┘  ││
         │  ┌───────────┐  │   VPN       │  └─────────────────────┘│
         │  │ On-Prem   │  │ (Backup)   │           │              │
         │  │ Firewall  │──┼─────────────┼──►VPN GW │              │
         │  └───────────┘  │   ◄─────────┼──────────┘              │
         │       │         │              │                        │
         │       ▼         │              │  ┌─────────────────┐   │
         │  ┌───────────┐  │              │  │ Transit Gateway │   │
         │  │ Router to │  │              │  │ (central hub)   │   │
         │  │ DX Location│ │              │  └─────────────────┘   │
         │  └───────────┘  │              └─────────────────────────┘
         └─────────────────┘
 
KEY DESIGN DECISIONS
 
1. Use Direct Connect (1 Gbps) as primary for production
   - Low latency for database queries
   - Predictable performance
   
2. VPN as backup (automatic failover via BGP)
   - Covers Direct Connect maintenance windows
   - Faster to provision initially
   
3. Transit Gateway as hub
   - Future-proofs for additional VPCs
   - Centralized routing and security
 
4. Private subnet for application
   - No direct internet access from app tier
   - Outbound via NAT Gateway if needed
 
5. Security controls
   - Security Groups: Allow only DB port (3306/5432) from app subnet
   - On-prem firewall: Allow only from known AWS CIDR ranges
   - Encryption: Consider database connection TLS in addition to DX/MACsec

Cloud Networking Vocabulary

Practical Tips for Scenario Questions

Beyond specific technical knowledge, certain approaches consistently help in scenario-based questions. These meta-strategies make you more effective regardless of the specific scenario.

Universal Scenario Strategies

•Always clarify scope — 'Before I dive in, can I ask a few clarifying questions?' Interviewers expect this; jumping straight to answers looks amateur.
•State your assumptions — 'I'm assuming this is a Linux server environment based on the symptoms described. Does that match your environment?'
•Verbalize trade-offs — 'We could do X which is faster to implement, or Y which is more robust. Given the time constraints you mentioned, I'd recommend X initially with a plan to implement Y.'
•Show prioritization — 'If this is an active production incident, my first priority is restoration of service, then root cause analysis. Let me walk through the immediate steps first.'
•Acknowledge what you don't know — 'I haven't worked with vendor X's specific implementation, but the underlying protocol works similarly across vendors. Let me describe the approach conceptually.'
•Offer to go deeper — 'I can go deeper into any of these areas—would you like me to elaborate on the routing design, or shall we move to the security aspects?'

Scenario Response Quality Levels
Aspect	Junior Response	Senior Response	Principal Response
Problem Framing	Accepts problem as stated	Asks clarifying questions	Identifies unstated assumptions and constraints
Solution Approach	Provides a single solution	Compares 2-3 options with trade-offs	Considers organizational/political factors too
Technical Depth	High-level, conceptual	Specific commands, configs	Design patterns, architecture implications
Risk Awareness	Focuses on solving problem	Notes potential risks of solution	Proposes mitigation strategies proactively
Communication	Technical details only	Technical + business impact	Tailored to audience, executive-ready summary

The 'What Would You Do' Pattern

Summary: Scenario Mastery

We've covered a range of practical scenarios that represent real interview challenges. The key is not memorizing specific answers but developing a systematic approach that works for any scenario.

Key Takeaways

•Troubleshooting — Use systematic methodology (CLEAR framework), isolate by layer, quantify problems with measurements.
•Design — Start with requirements, present options with trade-offs, justify choices based on constraints.
•Security — Contain before investigating, preserve evidence, think about detection and prevention for future.
•Performance — Understand BDP, recognize microbursting, know the difference between bandwidth and throughput.
•Cloud/Hybrid — Map cloud constructs to traditional networking, understand connectivity options and their trade-offs.
•Meta-strategies — Clarify, assume explicitly, show trade-off thinking, prioritize, acknowledge limits, offer depth.

Page Complete