Loading content...
Feature flags create a testing challenge that many teams underestimate. Each boolean flag doubles the number of possible code paths. With 5 independent flags, you have 2⁵ = 32 possible states. With 10 flags, you have over 1,000 combinations. Testing every combination is impossible—but testing none of them is negligent.
The question isn't whether to test flagged code, but how to test it intelligently. This page covers strategies that maximize test coverage while keeping the test suite maintainable. We'll explore how to test both flag variants, mock flag state effectively, verify flag removal safety, and design test infrastructure that scales with your flag usage.
By the end of this page, you will understand how to structure tests for feature-flagged code, mocking strategies for flag state, testing strategies for multi-flag scenarios, integration testing approaches, and techniques for safely verifying flag removal.
The core principle of testing flagged code is simple: test both variants, but not all combinations.
For each flag, create tests that verify:
You do NOT need to test every combination of multiple flags. Instead, test each flag's behavior in isolation, then use integration tests to verify the most critical multi-flag scenarios.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Basic pattern: Test both variants of a feature flag describe("CheckoutService", () => { let checkoutService: CheckoutService; let mockFeatureFlags: MockFeatureFlagService; beforeEach(() => { mockFeatureFlags = new MockFeatureFlagService(); checkoutService = new CheckoutService(mockFeatureFlags, ...otherDeps); }); describe("when new-checkout-v2 flag is OFF", () => { beforeEach(() => { mockFeatureFlags.setFlag("new-checkout-v2", false); }); it("should use legacy payment processor", async () => { const result = await checkoutService.processPayment(order); expect(result.processor).toBe("stripe-v1"); }); it("should calculate tax using legacy algorithm", async () => { const result = await checkoutService.calculateTotal(order); expect(result.taxCalculation).toBe("legacy"); }); it("should render legacy checkout UI template", async () => { const html = await checkoutService.renderCheckout(user); expect(html).toContain("checkout-legacy-form"); }); }); describe("when new-checkout-v2 flag is ON", () => { beforeEach(() => { mockFeatureFlags.setFlag("new-checkout-v2", true); }); it("should use new payment processor", async () => { const result = await checkoutService.processPayment(order); expect(result.processor).toBe("stripe-v2"); }); it("should calculate tax using new algorithm", async () => { const result = await checkoutService.calculateTotal(order); expect(result.taxCalculation).toBe("tax-service-v2"); }); it("should render new checkout UI template", async () => { const html = await checkoutService.renderCheckout(user); expect(html).toContain("checkout-v2-form"); }); });});Grouping tests by flag state (as shown above) makes it immediately clear what behavior is expected for each variant. When the flag is removed, you can delete the entire 'flag OFF' describe block and know exactly what tests to update.
Effective testing requires easy, reliable ways to control flag state. Never call the real flag service during unit tests—always use mocks.
Create a mock implementation of your feature flag abstraction:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
// Comprehensive mock feature flag service for testing export class MockFeatureFlagService implements FeatureFlagService { private flags = new Map<string, any>(); private evaluationLog: FlagEvaluation[] = []; private contextualFlags = new Map<string, Map<string, any>>(); // Simple flag setting setFlag(flagKey: string, value: any): void { this.flags.set(flagKey, value); } // Set multiple flags at once setFlags(flags: Record<string, any>): void { for (const [key, value] of Object.entries(flags)) { this.flags.set(key, value); } } // Contextual flag setting (different values for different users) setFlagForContext( flagKey: string, contextMatcher: (ctx: EvaluationContext) => boolean, value: any ): void { // Store as a function for evaluation time this.contextualFlags.set(flagKey, { matcher: contextMatcher, value }); } // Core evaluation isEnabled(flagKey: string, context?: EvaluationContext): boolean { const result = this.getValue(flagKey, context, false); // Log evaluation for assertion this.evaluationLog.push({ flagKey, context, returnedValue: result, timestamp: new Date(), }); return result; } getString(flagKey: string, context?: EvaluationContext, defaultValue = ""): string { return this.getValue(flagKey, context, defaultValue); } getNumber(flagKey: string, context?: EvaluationContext, defaultValue = 0): number { return this.getValue(flagKey, context, defaultValue); } getJSON<T>(flagKey: string, context?: EvaluationContext, defaultValue?: T): T { return this.getValue(flagKey, context, defaultValue); } private getValue<T>(flagKey: string, context?: EvaluationContext, defaultValue?: T): T { // Check contextual flags first const contextual = this.contextualFlags.get(flagKey); if (contextual && context && contextual.matcher(context)) { return contextual.value; } // Fall back to simple flags if (this.flags.has(flagKey)) { return this.flags.get(flagKey); } // Return default return defaultValue as T; } // Test utilities reset(): void { this.flags.clear(); this.contextualFlags.clear(); this.evaluationLog = []; } getEvaluationLog(): FlagEvaluation[] { return [...this.evaluationLog]; } assertFlagWasEvaluated(flagKey: string): void { const found = this.evaluationLog.some(e => e.flagKey === flagKey); if (!found) { throw new Error(`Expected flag "${flagKey}" to be evaluated, but it was not`); } } assertFlagWasNotEvaluated(flagKey: string): void { const found = this.evaluationLog.some(e => e.flagKey === flagKey); if (found) { throw new Error(`Expected flag "${flagKey}" to NOT be evaluated, but it was`); } }} interface FlagEvaluation { flagKey: string; context?: EvaluationContext; returnedValue: any; timestamp: Date;}Create helper functions that make setting up flag states ergonomic:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// Test helpers for feature flag testing // Helper to run a test with specific flag stateexport function withFlags( flagService: MockFeatureFlagService, flags: Record<string, any>, testFn: () => Promise<void> | void): () => Promise<void> { return async () => { const previousFlags = new Map(flagService['flags']); try { flagService.setFlags(flags); await testFn(); } finally { flagService['flags'] = previousFlags; } };} // Usageit("should handle premium features when enabled", withFlags(mockFlags, { "premium-features": true }, async () => { const result = await service.getPricingTier(user); expect(result).toBe("premium"); })); // Parameterized testing for both flag statesexport function testBothFlagStates( description: string, flagKey: string, testCases: { whenOff: (flagService: MockFeatureFlagService) => void; whenOn: (flagService: MockFeatureFlagService) => void; }): void { describe(description, () => { describe(`when ${flagKey} is OFF`, () => { beforeEach(() => { mockFeatureFlags.setFlag(flagKey, false); }); testCases.whenOff(mockFeatureFlags); }); describe(`when ${flagKey} is ON`, () => { beforeEach(() => { mockFeatureFlags.setFlag(flagKey, true); }); testCases.whenOn(mockFeatureFlags); }); });} // UsagetestBothFlagStates("payment processing", "stripe-v2-integration", { whenOff: (flags) => { it("uses legacy Stripe API", async () => { /* ... */ }); it("handles legacy error codes", async () => { /* ... */ }); }, whenOn: (flags) => { it("uses new Stripe API", async () => { /* ... */ }); it("handles new error codes", async () => { /* ... */ }); },});When multiple flags interact, testing every combination is impractical. Instead, use strategic sampling:
Only test combinations that represent real production scenarios or known interaction points:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
// Strategic multi-flag testing describe("Checkout with multiple flags", () => { // Define the critical combinations to test const criticalCombinations = [ // Baseline: all flags off (current production behavior) { name: "all legacy systems", flags: { "new-checkout-ui": false, "stripe-v2": false, "dynamic-pricing": false, }, }, // Target: all flags on (full new experience) { name: "all new systems", flags: { "new-checkout-ui": true, "stripe-v2": true, "dynamic-pricing": true, }, }, // Known interaction: new UI with legacy payment { name: "new UI with legacy payment", flags: { "new-checkout-ui": true, "stripe-v2": false, "dynamic-pricing": false, }, }, // Risky combination identified in design review { name: "dynamic pricing with legacy payment", flags: { "new-checkout-ui": false, "stripe-v2": false, "dynamic-pricing": true, }, }, ]; criticalCombinations.forEach(({ name, flags }) => { describe(`with ${name}`, () => { beforeEach(() => { mockFeatureFlags.setFlags(flags); }); it("should complete checkout successfully", async () => { const result = await checkoutService.completeCheckout(order); expect(result.success).toBe(true); }); it("should calculate correct total", async () => { const result = await checkoutService.calculateTotal(order); expect(result.total).toBeGreaterThan(0); }); it("should send correct confirmation email", async () => { await checkoutService.completeCheckout(order); expect(mockEmailService.sentEmails).toHaveLength(1); }); }); });});Instead of testing all combinations, use pairwise testing which covers all pairs of flag values. This drastically reduces test cases while catching most interaction bugs:
1234567891011121314151617181920212223242526272829303132333435363738
// Pairwise testing for multi-flag scenarios// With 4 flags and 2 values each, full coverage = 16 tests// Pairwise coverage = 6 tests (covering all pairs) const pairwiseCombinations = [ { flagA: true, flagB: true, flagC: true, flagD: true }, { flagA: true, flagB: false, flagC: false, flagD: false }, { flagA: false, flagB: true, flagC: false, flagD: true }, { flagA: false, flagB: false, flagC: true, flagD: false }, { flagA: true, flagB: true, flagC: false, flagD: false }, { flagA: false, flagB: false, flagC: true, flagD: true },]; // Tools like PICT (Microsoft) or jenny can generate these automatically// For complex flag sets, use:// https://www.pairwise.org/tools.html pairwiseCombinations.forEach((flags, index) => { describe(`pairwise combination ${index + 1}`, () => { beforeEach(() => { mockFeatureFlags.setFlags({ "feature-a": flags.flagA, "feature-b": flags.flagB, "feature-c": flags.flagC, "feature-d": flags.flagD, }); }); it("should not throw errors during processing", async () => { await expect(service.process()).resolves.not.toThrow(); }); it("should maintain data consistency", async () => { const result = await service.process(); expect(invariantHolds(result)).toBe(true); }); });});Resist the temptation to test every flag combination. With 10 flags, that's over 1,000 test cases. Instead, test: (1) all-off state, (2) all-on state, (3) each flag individually, (4) known risky combinations, (5) pairwise coverage if needed. This provides 80% of the value with 5% of the tests.
Integration tests verify that flag-controlled behavior works correctly in end-to-end scenarios. The challenge is controlling flag state in environments with real flag services.
Most flag platforms support test-specific overrides:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Integration test with flag overrides describe("Checkout Integration Tests", () => { let app: TestApp; beforeAll(async () => { app = await createTestApp({ // Connect to test flag environment flagEnvironment: "integration-test", }); }); describe("new checkout flow", () => { beforeEach(async () => { // Override flag for this test session await app.flagService.overrideFlag({ flagKey: "new-checkout-v2", value: true, scope: "test-session", sessionId: currentTestSessionId, }); }); afterEach(async () => { // Clear overrides await app.flagService.clearOverrides(currentTestSessionId); }); it("should complete checkout through new flow", async () => { const response = await app.request() .post("/api/checkout") .set("X-Test-Session-Id", currentTestSessionId) .send(validOrder); expect(response.status).toBe(200); expect(response.body.checkoutVersion).toBe("v2"); }); });}); // Alternative: Header-based flag overrides for E2E testsdescribe("E2E Checkout Tests", () => { it("should use new checkout when override header is set", async () => { const response = await fetch("/checkout", { headers: { // Special header recognized by flag service "X-Feature-Override": "new-checkout-v2=true", }, }); const html = await response.text(); expect(html).toContain("checkout-v2-container"); });});Use dedicated test users that are targeted by flags:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Test users with specific flag targeting // In flag configurationconst newCheckoutFlag = { key: "new-checkout-v2", rules: [ // Integration test users always get new checkout { type: "attribute", attribute: "email", operator: "matches", value: ".*@test\.internal\.company\.com$", return: true, }, // Production users get gradual rollout { type: "percentage", percentage: 25, }, ],}; // In integration testsdescribe("Checkout Integration with Test Users", () => { const testUser = { email: "checkout-v2-tester@test.internal.company.com", id: "test-user-123", }; it("should always get new checkout for test user", async () => { const response = await api .post("/api/checkout") .authenticate(testUser) .send(validOrder); expect(response.body.checkoutVersion).toBe("v2"); });}); // Dedicated test user pools per flag variantconst testUserPools = { "new-checkout-v2": { enabled: [ "checkout-v2-enabled-1@test.internal.company.com", "checkout-v2-enabled-2@test.internal.company.com", ], disabled: [ "checkout-v2-disabled-1@test.internal.company.com", "checkout-v2-disabled-2@test.internal.company.com", ], },};Integration tests can interfere with each other if they modify shared flag state. Use test-session-scoped overrides, dedicated test users, or reset flag state between tests. Flaky tests are often caused by leaked flag state.
The most dangerous moment in a flag's lifecycle is removal. The code changes, dead code paths are deleted, and there's always the chance something was depending on that "dead" code. Here's how to test flag removal safely.
Before removing the flag from code, verify it's safe:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
// Pre-removal verification checklist (automated) interface FlagRemovalSafetyCheck { flagKey: string; checks: SafetyCheck[];} const safetyChecks: SafetyCheck[] = [ { name: "Flag always returns expected value", check: async (flagKey: string) => { const stats = await analytics.getFlagStats(flagKey, { timeRange: days(30), }); // Should always return true if we're removing the "off" path const expectedValue = true; const alwaysReturnsExpected = stats.returnedValues.every( v => v.value === expectedValue ); return { passed: alwaysReturnsExpected, message: alwaysReturnsExpected ? "Flag has returned consistent value for 30 days" : `Flag returned unexpected values: ${JSON.stringify(stats.returnedValues)}`, }; }, }, { name: "No error rate increase when enabled", check: async (flagKey: string) => { const metrics = await observability.getErrorMetrics({ flagKey, compareEnabledVsDisabled: true, timeRange: days(14), }); const errorRateIncrease = metrics.enabledErrorRate - metrics.disabledErrorRate; const threshold = 0.01; // 1% increase threshold return { passed: errorRateIncrease < threshold, message: errorRateIncrease < threshold ? "Error rate is stable when flag is enabled" : `Error rate increased by ${(errorRateIncrease * 100).toFixed(2)}% when enabled`, }; }, }, { name: "All tests pass with flag hardcoded", check: async (flagKey: string) => { // Run test suite with flag forced to target value const testResult = await ci.runTestsWithFlagOverride({ [flagKey]: true, // Hardcode to value we're keeping }); return { passed: testResult.failed === 0, message: testResult.failed === 0 ? "All tests pass with flag hardcoded to target value" : `${testResult.failed} tests failed`, }; }, }, { name: "No code references to disabled path", check: async (flagKey: string) => { const references = await codeSearch.findFlagReferences(flagKey); const disabledPathRefs = references.filter( ref => ref.isDisabledPath ); return { passed: disabledPathRefs.length === 0, message: disabledPathRefs.length === 0 ? "No active code depends on disabled path" : `Found ${disabledPathRefs.length} references that may depend on disabled path`, }; }, },]; // Run before creating removal PRasync function verifyFlagRemovalSafety(flagKey: string): Promise<SafetyReport> { const results = []; for (const check of safetyChecks) { const result = await check.check(flagKey); results.push({ name: check.name, ...result }); } const allPassed = results.every(r => r.passed); return { flagKey, safe: allPassed, checks: results, recommendation: allPassed ? "Safe to remove" : "Address failed checks before removal", };}Remove the flag from a small percentage of traffic first, then expand:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Canary flag removal process /** * Instead of removing the flag entirely, first hardcode * the value for a percentage of traffic */ // Step 1: Modify flag to hardcode for 5% of usersconst checkoutFlag = { key: "new-checkout-v2", rules: [ // 5% "canary" - behave as if flag is removed (always true) { type: "percentage", percentage: 5, return: true, variant: "canary-removal", }, // Remaining 95% - normal flag evaluation { type: "percentage", percentage: 100, return: featureFlags.isEnabled("new-checkout-v2"), variant: "normal", }, ],}; // Step 2: Monitor canary group for issuesasync function monitorCanaryRemoval(flagKey: string): Promise<void> { const metrics = await observability.compareVariants({ flagKey, variants: ["canary-removal", "normal"], metrics: ["error_rate", "latency_p99", "conversion_rate"], alertThreshold: 0.05, // 5% degradation triggers alert }); if (metrics.degradation > 0.05) { await alerting.sendCritical({ message: `Flag removal canary showing degradation: ${flagKey}`, details: metrics, }); }} // Step 3: Gradually increase canary percentage// 5% → 25% → 50% → 100% → remove code // Step 4: Only after 100% canary is stable, remove the flag codeIronically, you can use a meta-flag to control flag removal. Create a 'remove-checkout-v2-flag' flag that, when enabled, causes the checkout-v2 flag to behave as if it's always enabled. This gives you instant rollback capability during the removal process.
Beyond testing behavior, you need to verify that the right users get the right flag values. Targeting rule bugs can expose features to wrong audiences or exclude intended users.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
// Testing flag targeting logic describe("new-checkout-v2 targeting rules", () => { const flagKey = "new-checkout-v2"; describe("internal user targeting", () => { it("should enable for internal users", () => { const result = featureFlags.isEnabled(flagKey, { userId: "user-123", email: "engineer@company.com", }); expect(result).toBe(true); }); it("should not auto-enable for external users with similar domains", () => { const result = featureFlags.isEnabled(flagKey, { userId: "user-456", email: "hacker@malicious-company.com", }); // Should fall through to percentage targeting expect(result).toBeDefined(); }); }); describe("percentage rollout", () => { it("should be deterministic for same user", () => { const context = { userId: "consistent-user-123" }; // Run multiple times - should always return same value const results = Array(100).fill(null).map(() => featureFlags.isEnabled(flagKey, context) ); const uniqueResults = [...new Set(results)]; expect(uniqueResults).toHaveLength(1); // All same value }); it("should match expected rollout percentage approximately", () => { // Generate many unique users and check distribution const sampleSize = 10000; const users = Array(sampleSize).fill(null).map((_, i) => ({ userId: `test-user-${i}`, })); const enabledCount = users.filter(user => featureFlags.isEnabled(flagKey, user) ).length; const actualPercentage = enabledCount / sampleSize; const expectedPercentage = 0.25; // 25% rollout const tolerance = 0.03; // ±3% expect(actualPercentage).toBeGreaterThan(expectedPercentage - tolerance); expect(actualPercentage).toBeLessThan(expectedPercentage + tolerance); }); }); describe("segment targeting", () => { it("should enable for premium tier users", () => { const result = featureFlags.isEnabled(flagKey, { userId: "premium-user", subscriptionTier: "premium", }); expect(result).toBe(true); }); it("should not enable for free tier users (outside rollout)", () => { // Use user ID known to be outside rollout percentage const result = featureFlags.isEnabled(flagKey, { userId: "free-user-outside-rollout", subscriptionTier: "free", }); expect(result).toBe(false); }); }); describe("geographic targeting", () => { it("should enable for US users", () => { const result = featureFlags.isEnabled(flagKey, { userId: "us-user", country: "US", }); expect(result).toBe(true); }); it("should not enable for EU users yet", () => { const result = featureFlags.isEnabled(flagKey, { userId: "eu-user", country: "DE", }); expect(result).toBe(false); }); });});Targeting rule tests are distinct from behavior tests. Behavior tests verify what happens when the flag is on/off. Targeting tests verify who gets which value. Both are necessary for complete coverage.
As flag usage grows, invest in test infrastructure that makes flag testing frictionless:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// Comprehensive flag test infrastructure // 1. Jest/Vitest setup with automatic flag mocking// jest.setup.tsimport { mockFeatureFlagService } from "./test-utils/mock-flags"; beforeEach(() => { // Reset flags before each test mockFeatureFlagService.reset(); // Warn if test doesn't set flag state explicitly jest.spyOn(console, "warn").mockImplementation((msg) => { if (msg.includes("Flag evaluated without explicit test state")) { throw new Error(msg); } });}); // 2. Flag coverage trackingclass FlagCoverageTracker { private covered = new Map<string, Set<any>>(); recordEvaluation(flagKey: string, returnedValue: any): void { if (!this.covered.has(flagKey)) { this.covered.set(flagKey, new Set()); } this.covered.get(flagKey)!.add(returnedValue); } getReport(): FlagCoverageReport { const registeredFlags = flagRegistry.getAllFlags(); return registeredFlags.map(flag => { const testedValues = this.covered.get(flag.key) || new Set(); const expectedValues = flag.possibleValues; return { flagKey: flag.key, testedValues: [...testedValues], expectedValues, coverage: testedValues.size / expectedValues.length, missing: expectedValues.filter(v => !testedValues.has(v)), }; }); }} // 3. CI configuration for matrix testing// .github/workflows/test.yml`name: Tests with Flag Matrix jobs: test: strategy: matrix: flag-state: [all-off, all-on, production-config] steps: - run: npm test env: FLAG_TEST_MODE: ${{ matrix.flag-state }}`; // 4. Flag state determinism checkingexport function ensureDeterministicFlagState(testFn: () => void): void { // Run test twice and verify flag evaluations are identical const run1 = captureEvaluations(testFn); const run2 = captureEvaluations(testFn); expect(run1).toEqual(run2);}We've covered comprehensive strategies for testing feature-flagged code. Let's consolidate:
Module Complete:
You now have a comprehensive understanding of feature flags: what they are, how to design them, how to manage their lifecycle, and how to test them effectively. Feature flags are a powerful tool for modern software delivery, enabling safer deployments, experimentation, and operational control. With the patterns and practices covered in this module, you can adopt flags confidently while avoiding the common pitfalls that lead to technical debt.
Congratulations! You've completed the Feature Flags module. You now understand the complete feature flag lifecycle: from creation and design patterns, through governance and lifecycle management, to comprehensive testing strategies. Apply these practices to enjoy the benefits of flags—progressive rollouts, instant rollback, experimentation—without accumulating crippling technical debt.