Loading learning content...
The product catalog is the foundation upon which all e-commerce functionality is built. Every search query, every product page view, every recommendation, every price display—all depend on a catalog system that can serve accurate, up-to-date product information at massive scale.
Amazon's catalog contains over 350 million unique products from 2+ million active sellers. Each product has dozens of attributes, multiple variants, high-resolution images, customer reviews, pricing rules, and availability information that varies by region and fulfillment center. The catalog must support:
This page will take you through the complete architecture of a catalog system designed for this scale.
By the end of this page, you will understand how to design a product catalog architecture that separates concerns between primary storage, search, and caching; how to model complex product data with variants and attributes; and how to maintain consistency across a distributed catalog system.
Before diving into architecture, we must understand what we're storing. Product data is surprisingly complex—far more than a simple table of items with names and prices.
The Core Entities:
A product in an e-commerce catalog is actually a hierarchical structure:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
// Core Product Entity - The "Parent" productinterface Product { id: string; // Globally unique identifier (ASIN equivalent) sellerId: string; // Merchant/brand who owns the listing // Basic Information title: string; // "Sony WH-1000XM5 Wireless Noise Canceling Headphones" brand: string; // "Sony" manufacturer: string; // May differ from brand description: string; // HTML-formatted long description bulletPoints: string[]; // Key feature highlights // Categorization categoryPath: string[]; // ["Electronics", "Audio", "Headphones", "Over-Ear"] categoryIds: string[]; // Internal category identifiers // Attributes (category-specific) attributes: Record<string, Attribute>; // Search optimization keywords: string[]; // Seller-provided search terms // State management status: 'draft' | 'pending_review' | 'active' | 'suppressed' | 'archived'; createdAt: Timestamp; updatedAt: Timestamp; // Relationships variants: ProductVariant[]; // Color/size variations // Aggregate data (computed) averageRating: number; // 4.7 reviewCount: number; // 12,847 priceRange: PriceRange; // Min/max across variants} // Product Variants - Each purchasable iteminterface ProductVariant { id: string; // SKU-level identifier productId: string; // Parent product reference // Variant-defining attributes variantAttributes: { color?: string; // "Black", "Silver", "Midnight Blue" size?: string; // "Small", "Medium", "Large" configuration?: string; // "256GB", "512GB" style?: string; // "Standard", "Premium Edition" }; // Pricing (can vary by variant) listPrice: Money; // MSRP currentPrice: Money; // Active selling price dealPrice?: Money; // Special promotion price dealEndTime?: Timestamp; // When deal expires // Availability inStock: boolean; // Aggregated stock status stockLevel: StockLevel; // 'in_stock' | 'low_stock' | 'out_of_stock' // Media images: ProductImage[]; // Variant-specific images // Shipping dimensions: Dimensions; // For shipping calculation weight: Weight; fulfillmentOptions: FulfillmentOption[];} // Complex attribute structure supporting rich product datainterface Attribute { name: string; // "Noise Cancellation Type" value: string | string[] | number; // "Active" or ["Active", "Passive"] unit?: string; // "hours", "inches", "GB" filterable: boolean; // Can be used in search facets displayOrder: number; // UI rendering order normalizedValue?: string; // Standardized for comparison}This parent-variant model is crucial for UX and data management. When a customer searches for 'Sony headphones', they want to see one result per product—not separate listings for every color. But when they add to cart, they're buying a specific variant. The model must support both views efficiently.
Category-Specific Attributes:
One of the most challenging aspects of catalog design is that different product categories have vastly different attributes. Electronics have specifications (battery life, connectivity); clothing has sizes; furniture has dimensions; food has nutritional information.
This requires a flexible schema approach:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Category defines which attributes are relevant and how they're validatedinterface CategorySchema { categoryId: string; name: string; parentId?: string; // For hierarchy // Required attributes for this category requiredAttributes: AttributeDefinition[]; // Optional but recommended attributes optionalAttributes: AttributeDefinition[]; // Validation rules validationRules: ValidationRule[];} interface AttributeDefinition { name: string; displayName: string; // User-facing name type: 'string' | 'number' | 'boolean' | 'enum' | 'multi-enum' | 'range'; enumValues?: string[]; // For enum types unit?: string; filterable: boolean; // Show in faceted search comparable: boolean; // Show in product comparison searchable: boolean; // Include in full-text search} // Example: Electronics > Audio > Headphones category schemaconst headphonesCategorySchema: CategorySchema = { categoryId: "electronics_audio_headphones", name: "Headphones", parentId: "electronics_audio", requiredAttributes: [ { name: "headphone_type", displayName: "Type", type: "enum", enumValues: ["Over-Ear", "On-Ear", "In-Ear", "Earbuds"], filterable: true }, { name: "connectivity", displayName: "Connectivity", type: "multi-enum", enumValues: ["Wireless", "Bluetooth", "3.5mm Jack", "USB-C"], filterable: true }, { name: "noise_cancellation", displayName: "Noise Cancellation", type: "enum", enumValues: ["Active", "Passive", "None"], filterable: true }, ], optionalAttributes: [ { name: "battery_life", displayName: "Battery Life", type: "number", unit: "hours", filterable: true, searchable: false }, { name: "driver_size", displayName: "Driver Size", type: "number", unit: "mm", filterable: true }, { name: "frequency_response", displayName: "Frequency Response", type: "range", unit: "Hz" }, ], validationRules: [ { rule: "battery_life_required_for_wireless", condition: "connectivity includes 'Wireless'", requirement: "battery_life is required" } ]};A production catalog system uses multiple storage layers optimized for different access patterns. No single database can efficiently handle all catalog operations.
The Three-Layer Storage Pattern:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
┌─────────────────────────────────────────────────────────────────────────┐│ CATALOG STORAGE ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────────┐ ││ │ WRITE PATH │ ││ │ │ ││ │ Seller Portal │──────┐ ││ │ Admin Tools │ │ ││ │ Bulk Import │ │ ││ └──────────────────┘ │ ││ ▼ ││ ┌───────────────────────────────────────────┐ ││ │ PRIMARY STORE (Source of Truth) │ ││ │ ┌────────────────────────────────────┐ │ ││ │ │ PostgreSQL / DynamoDB │ │ ││ │ │ • All product data with history │ │ ││ │ │ • Strong consistency on writes │ │ ││ │ │ • Complex seller queries │ │ ││ │ │ • ACID transactions for updates │ │ ││ │ └────────────────────────────────────┘ │ ││ └─────────────────────┬─────────────────────┘ ││ │ ││ │ Change Data Capture (CDC) ││ │ ││ ┌──────────────┴──────────────┐ ││ │ │ ││ ▼ ▼ ││ ┌─────────────────────┐ ┌─────────────────────┐ ││ │ SEARCH STORE │ │ READ CACHE │ ││ │ ┌───────────────┐ │ │ ┌───────────────┐ │ ││ │ │ Elasticsearch │ │ │ │ Redis │ │ ││ │ │ │ │ │ │ │ │ ││ │ │• Full-text │ │ │ │• Hot products │ │ ││ │ │• Faceted nav │ │ │ │• Sub-ms reads │ │ ││ │ │• Aggregations │ │ │ │• TTL-based │ │ ││ │ │• Analytics │ │ │ │• Cache-aside │ │ ││ │ └───────────────┘ │ │ └───────────────┘ │ ││ └──────────┬──────────┘ └─────────┬───────────┘ ││ │ │ ││ └────────────┬────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────┐ ││ │ READ PATH (APIs) │ ││ │ │ ││ │ • Product pages → Cache │ ││ │ • Search → Elasticsearch │ ││ │ • Browse → Elasticsearch │ ││ └────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Primary Store Choice: PostgreSQL vs DynamoDB
The choice of primary store depends on your specific requirements:
PostgreSQL with JSON columns works well when:
DynamoDB works well when:
For Amazon-scale catalogs, a hybrid approach is common: DynamoDB for the high-throughput primary product data with PostgreSQL for seller analytics and complex reporting.
| Characteristic | Primary Store | Search Store | Cache Layer |
|---|---|---|---|
| Technology | PostgreSQL or DynamoDB | Elasticsearch | Redis Cluster |
| Data Volume | 350M products × 50KB = 17.5TB | 350M docs × 5KB = 1.75TB (indexed) | Hot 10M × 5KB = 50GB |
| Write Pattern | Thousands of updates/second | Async sync from CDC | On-demand populate |
| Read Latency | 5-50ms | 20-100ms | <5ms |
| Consistency | Strong (source of truth) | Eventual (seconds lag) | Eventual (TTL-based) |
| Primary Use | Authoritative data | Search & browse | Product page speed |
Search is the primary way customers interact with the catalog. A well-designed search system must handle:
Elasticsearch Architecture:
For a catalog of 350M+ products, a single Elasticsearch cluster isn't sufficient. We need a distributed architecture:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
{ "settings": { "number_of_shards": 48, "number_of_replicas": 2, "refresh_interval": "30s", "analysis": { "analyzer": { "product_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "snowball", "synonym_filter"] }, "autocomplete_analyzer": { "type": "custom", "tokenizer": "edge_ngram_tokenizer", "filter": ["lowercase"] } }, "tokenizer": { "edge_ngram_tokenizer": { "type": "edge_ngram", "min_gram": 2, "max_gram": 15, "token_chars": ["letter", "digit"] } }, "filter": { "synonym_filter": { "type": "synonym", "synonyms_path": "synonyms.txt" } } } }, "mappings": { "properties": { "product_id": { "type": "keyword" }, "title": { "type": "text", "analyzer": "product_analyzer", "fields": { "autocomplete": { "type": "text", "analyzer": "autocomplete_analyzer" }, "exact": { "type": "keyword" } } }, "description": { "type": "text", "analyzer": "product_analyzer" }, "brand": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "category_path": { "type": "keyword" }, "price": { "type": "scaled_float", "scaling_factor": 100 }, "rating": { "type": "float" }, "review_count": { "type": "integer" }, "in_stock": { "type": "boolean" }, "attributes": { "type": "flattened" }, "popularity_score": { "type": "float" }, "search_keywords": { "type": "text", "analyzer": "product_analyzer" }, "suggestion": { "type": "completion", "analyzer": "product_analyzer" } } }}1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
{ "query": { "function_score": { "query": { "bool": { "must": [ { "multi_match": { "query": "wireless headphones", "fields": ["title^3", "brand^2", "description", "search_keywords"], "type": "best_fields", "fuzziness": "AUTO" } } ], "filter": [ { "term": { "in_stock": true } }, { "range": { "price": { "lte": 200 } } }, { "terms": { "attributes.noise_cancellation": ["Active"] } } ] } }, "functions": [ { "field_value_factor": { "field": "popularity_score", "factor": 1.2, "modifier": "log1p" } }, { "field_value_factor": { "field": "rating", "factor": 1.5, "modifier": "sqrt" } }, { "filter": { "range": { "review_count": { "gte": 100 } } }, "weight": 1.3 } ], "score_mode": "multiply", "boost_mode": "multiply" } }, "aggs": { "brand_facet": { "terms": { "field": "brand.keyword", "size": 20 } }, "price_ranges": { "range": { "field": "price", "ranges": [ { "to": 50 }, { "from": 50, "to": 100 }, { "from": 100, "to": 200 }, { "from": 200 } ] } }, "rating_histogram": { "histogram": { "field": "rating", "interval": 1 } } }, "highlight": { "fields": { "title": {}, "description": { "number_of_fragments": 2 } } }, "size": 24, "from": 0}The function_score query is the secret to good search relevance. It combines text relevance with business signals (popularity, ratings, review count) to rank results. Tuning these weights is an ongoing process based on A/B testing click-through rates and conversion.
Caching is critical for catalog performance. With 70,000 QPS hitting product pages, even a small cache improvement has massive impact. We employ multiple caching layers:
| Layer | Technology | TTL | Hit Rate Target | Purpose |
|---|---|---|---|---|
| CDN Edge | CloudFront/Fastly | 5 min | 40% | Static assets, popular product pages |
| Application Cache | Redis Cluster | 15 min | 85% | Product data, computed fields |
| Local Cache | In-process LRU | 1 min | 60% | Extremely hot keys, reduce Redis hops |
| Search Cache | Elasticsearch cache | 30 sec | 70% | Common search queries, facet aggregations |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Cache key design for product catalog interface CacheKeyStrategy { // Product data - versioned to enable instant invalidation productData: (productId: string, version: number) => string; // Price data - separate cache with shorter TTL (prices change frequently) productPrice: (productId: string, variantId: string, userId?: string) => string; // Search results - include normalized query and filters searchResults: (queryHash: string, page: number) => string; // Category data - hierarchical with parent invalidation categoryProducts: (categoryId: string, filters: string, page: number) => string;} const cacheKeys: CacheKeyStrategy = { productData: (productId, version) => `product:data:${productId}:v${version}`, productPrice: (productId, variantId, userId) => userId ? `product:price:${productId}:${variantId}:user:${userId}` // Personalized : `product:price:${productId}:${variantId}:default`, // Standard searchResults: (queryHash, page) => `search:results:${queryHash}:page:${page}`, categoryProducts: (categoryId, filters, page) => `category:${categoryId}:filters:${filters}:page:${page}`}; // Cache-aside pattern with version checkingasync function getProduct(productId: string): Promise<Product> { // Get current version from lightweight metadata const currentVersion = await redis.get(`product:version:${productId}`); // Try cache with version const cacheKey = cacheKeys.productData(productId, parseInt(currentVersion || '0')); const cached = await redis.get(cacheKey); if (cached) { return JSON.parse(cached); } // Cache miss - fetch from primary store const product = await primaryStore.getProduct(productId); // Store in cache with TTL await redis.setex(cacheKey, 900, JSON.stringify(product)); // 15 min TTL return product;} // On product update, just bump the versionasync function invalidateProduct(productId: string): Promise<void> { await redis.incr(`product:version:${productId}`); // Old cached entries will naturally expire // New requests get new version number, triggering cache miss}Cache invalidation is one of the hardest problems in distributed systems. With products in CDN, Redis, Elasticsearch, and local caches simultaneously, ensuring consistency is complex. The version-based approach above provides eventual consistency—a product update might take 15 minutes to propagate to all layers, but this is acceptable for catalog data.
With data distributed across primary store, search index, and cache, keeping them synchronized is a critical architectural concern. We use Change Data Capture (CDC) as the foundation for reliable sync.
12345678910111213141516171819202122232425262728293031323334353637383940
┌─────────────────────────────────────────────────────────────────────────────┐│ CATALOG SYNCHRONIZATION PIPELINE │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────┐ ││ │ PostgreSQL │ ││ │ (Primary) │ ││ └──────┬───────┘ ││ │ WAL (Write-Ahead Log) ││ │ ││ ▼ ││ ┌──────────────┐ ││ │ Debezium │ Captures all changes as events ││ │ (CDC Engine) │ ││ └──────┬───────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────┐ ││ │ Apache Kafka │ ││ │ ┌─────────────────────────────────────────────┐ │ ││ │ │ catalog.products.changes (partitioned by │ │ ││ │ │ productId for ordering guarantees) │ │ ││ │ └─────────────────────────────────────────────┘ │ ││ └─────────────────────────┬────────────────────────────┘ ││ │ ││ ┌────────────────────┼────────────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ ES │ │ Redis │ │ CDN │ ││ │ Indexer │ │ Updater │ │ Purger │ ││ │ │ │ │ │ │ ││ │• Batch │ │• Increment│ │• API call │ ││ │ indexing │ │ version │ │ to purge │ ││ │• Bulk API │ │• Update │ │• Selective│ ││ │• Retry on │ │ price if │ │ invalidate│ ││ │ failure │ │ changed │ │ │ ││ └───────────┘ └───────────┘ └───────────┘ ││ │└─────────────────────────────────────────────────────────────────────────────┘12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
// Kafka consumer processing catalog change events interface ProductChangeEvent { eventId: string; timestamp: number; operation: 'INSERT' | 'UPDATE' | 'DELETE'; productId: string; before?: Partial<Product>; // Previous state (for updates) after?: Product; // New state (for inserts/updates) changedFields?: string[]; // Which fields changed} class CatalogSyncConsumer { async processEvent(event: ProductChangeEvent): Promise<void> { const { productId, operation, after, changedFields } = event; try { // Always sync to Elasticsearch (full document) if (operation === 'DELETE') { await this.elasticsearch.delete('products', productId); } else { await this.elasticsearch.index('products', productId, this.transformForSearch(after)); } // Increment cache version (lazy invalidation) await this.redis.incr(`product:version:${productId}`); // If price changed, also update price cache immediately if (changedFields?.includes('currentPrice')) { await this.updatePriceCache(productId, after); } // If it's a popular product, purge CDN cache if (await this.isHighTrafficProduct(productId)) { await this.cdn.purge(`/products/${productId}*`); } // Record sync completion for monitoring await this.metrics.recordSync(productId, event.timestamp); } catch (error) { // Send to dead-letter queue for retry await this.deadLetterQueue.send(event, error); throw error; } } // Transform product for search-optimized format private transformForSearch(product: Product): SearchDocument { return { product_id: product.id, title: product.title, description: product.description, brand: product.brand, category_path: product.categoryPath, price: this.getLowestVariantPrice(product), rating: product.averageRating, review_count: product.reviewCount, in_stock: this.hasAnyInStockVariant(product), attributes: this.flattenAttributes(product.attributes), popularity_score: this.calculatePopularity(product), search_keywords: product.keywords.join(' '), updated_at: new Date().toISOString() }; }}Kafka provides exactly-once semantics, ordered message delivery (within partitions), and persistent storage of events. If the Elasticsearch indexer goes down, it can resume from its last committed offset without losing events. This durability is essential for keeping stores in sync.
A catalog outage means customers can't browse, search, or view products—effectively shutting down the entire e-commerce operation. The catalog service requires 99.99% availability, allowing only ~52 minutes of downtime per year.
Multi-Region Deployment Strategy:
123456789101112131415161718192021222324252627282930313233
┌───────────────┐ │ Global │ │ Load Balancer │ │ (Route 53) │ └───────┬───────┘ │ ┌────────────────────────────────┼────────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ US-EAST │ │ EU-WEST │ │ AP-SOUTHEAST │ │ (Primary for │ │ (Primary for │ │ (Primary for │ │ Americas) │ │ Europe) │ │ Asia) │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │ │ ┌───────┴───────┐ ┌───────┴───────┐ ┌───────┴───────┐ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐│Redis │ │ ES │ │Redis │ │ ES │ │Redis │ │ ES ││Cluster│ │Cluster│ │Cluster│ │Cluster│ │Cluster│ │Cluster│└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │ │ │ └────────────────────────────────┼────────────────────────────────┘ │ ┌────────────────┴────────────────┐ │ │ ▼ ▼ ┌───────────────┐ ┌───────────────┐ │ Primary DB │ ───────────────→│ Read Replicas │ │ (US-EAST) │ Cross-region │ (EU, APAC) │ │ │ replication │ │ └───────────────┘ └───────────────┘We've covered the complete architecture of a product catalog system designed for Amazon-scale operations. Let's consolidate the key architectural decisions:
What's Next:
With the catalog architecture established, we'll dive into the Shopping Cart Service—one of the most stateful components in e-commerce. You'll learn how to design a cart that persists across sessions and devices, handles race conditions when inventory changes, and seamlessly merges guest and authenticated user carts.
You now understand how to architect a product catalog system that can serve 350M+ products to millions of concurrent users with sub-200ms latency. The key insight is that 'the catalog' is actually multiple specialized systems working together, each optimized for its specific purpose.