Loading system design...
Design a distributed task scheduling system (like Airflow, Celery, or Temporal) that reliably schedules and executes tasks at specified times across a fleet of workers. The system supports one-time tasks, recurring cron jobs, task DAGs (workflows), priority queues, retry with backoff, and guarantees at-least-once execution even under node failures.
| Metric | Value |
|---|---|
| Tasks scheduled per day | 100 million |
| Peak task submissions per second | 10,000 |
| Concurrent running tasks | 500,000 |
| Worker nodes | 1,000+ |
| Cron jobs registered | 5 million |
| DAG workflows | 100,000 |
| Average task duration | 5 seconds (range: 100ms – 1 hour) |
| Scheduling accuracy | ≤ 1 second from scheduled time |
| Task state queries per second | 50,000 |
| Retention of execution history | 30 days |
One-time task scheduling: submit a task to execute at a specific future time (e.g., 'send email at 2025-03-15 09:00 UTC'); task stored durably and executed exactly at the scheduled time (within 1-second accuracy)
Recurring / cron scheduling: define tasks that repeat on a schedule (cron expression: '0 9 * * MON' = every Monday 9 AM); system generates the next execution time after each run; support cron, fixed-rate, and fixed-delay intervals
At-least-once execution guarantee: every scheduled task executes at least once; no task silently dropped due to node crashes, network failures, or restarts; duplicate execution acceptable (idempotent tasks preferred)
Task lifecycle management: tasks go through states — SCHEDULED → QUEUED → RUNNING → SUCCEEDED / FAILED / RETRYING; support cancel, pause, resume; queryable task status and history
Distributed execution: tasks distributed across a fleet of worker nodes; no single point of failure; adding workers increases capacity linearly (horizontal scaling)
Retry with backoff: on task failure, retry with configurable exponential backoff (1s, 2s, 4s, 8s…) and max retry count; dead-letter queue for permanently failed tasks after exhausting retries
Priority and queues: support multiple priority levels (critical, high, normal, low); higher-priority tasks preempt lower-priority ones in the execution queue; named queues for task isolation (e.g., 'email-queue', 'report-queue')
Rate limiting and concurrency control: limit the number of concurrent tasks per queue, per tenant, or per task type; prevent a single heavy task type from starving others
Task dependencies (DAG): define that task B depends on task A (B runs only after A succeeds); support directed acyclic graphs (DAGs) of tasks; a DAG is a workflow (like Airflow)
Multi-tenancy: support multiple tenants (teams/services) sharing the scheduler; tenant-level isolation for quotas, rate limits, and task visibility; fair scheduling across tenants
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?