Loading content...
In the previous page, we established that information is data with context and meaning. But how exactly does raw data become meaningful information? The answer lies in data processing—the systematic sequence of operations that collects, manipulates, stores, retrieves, and disseminates data to produce useful outputs.
Data processing is not a modern invention. Humans have processed data for millennia—from ancient census counts to merchant ledgers to library catalogs. What modern database systems provide is the ability to perform these operations at unprecedented scale, speed, and reliability. Understanding data processing is understanding the operational core of every information system.
By the end of this page, you will understand the complete data processing cycle, different processing methodologies (batch, real-time, stream), the operations that constitute data processing, and how modern database systems implement these concepts at scale.
The Data Processing Cycle (also called the Information Processing Cycle) is the fundamental sequence of stages through which data passes from raw input to meaningful output. While specific implementations vary, all data processing follows this general pattern.
Every data processing operation—whether performed by a pencil-and-paper clerk or a distributed database cluster—follows these fundamental stages:
Stage 1: Collection (Data Gathering)
The cycle begins with collecting raw data from its sources. In modern systems, collection happens through:
Collection quality directly impacts all subsequent stages. Errors introduced here propagate through the entire cycle. This is where the principle "garbage in, garbage out" originates—low-quality collection produces low-quality results regardless of processing sophistication.
Stage 2: Preparation (Data Cleaning and Validation)
Raw collected data rarely arrives in perfect form. Preparation involves:
This stage is often underestimated but critically important. Studies consistently show that data professionals spend 60-80% of their time on data preparation. Quality preparation enables quality processing.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
-- Example: Preparing raw customer data for processing -- Raw imported data (often messy)CREATE TEMPORARY TABLE raw_customers ( name TEXT, email TEXT, phone TEXT, join_date TEXT, status TEXT); -- Preparation: Clean, validate, standardizeINSERT INTO customers ( first_name, last_name, email, phone, created_at, is_active)SELECT -- Split and clean name TRIM(SPLIT_PART(name, ' ', 1)) AS first_name, TRIM(SPLIT_PART(name, ' ', 2)) AS last_name, -- Standardize email to lowercase LOWER(TRIM(email)) AS email, -- Normalize phone format (remove non-digits) REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS phone, -- Parse date with multiple format support COALESCE( TO_DATE(join_date, 'YYYY-MM-DD'), TO_DATE(join_date, 'MM/DD/YYYY'), TO_DATE(join_date, 'DD-Mon-YYYY'), CURRENT_DATE -- Default if all parsing fails ) AS created_at, -- Standardize status to boolean CASE UPPER(TRIM(status)) WHEN 'ACTIVE' THEN true WHEN 'YES' THEN true WHEN '1' THEN true ELSE false END AS is_active FROM raw_customersWHERE -- Validate: must have email email IS NOT NULL AND email LIKE '%@%.%' -- Validate: must have name AND name IS NOT NULL AND LENGTH(TRIM(name)) > 0;Stage 3: Input (Data Entry)
Input is the formal introduction of prepared data into the processing system. This involves:
In database systems, input often manifests as INSERT, LOAD DATA, or bulk import operations. The input stage is where data crosses the boundary from "outside" to "inside" the system.
Stage 4: Processing (Data Manipulation)
This is the core stage where data is transformed into information. Processing operations include:
Modern database systems provide powerful processing capabilities through SQL and query engines. A single query can perform multiple processing operations:
SELECT
department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary,
MAX(salary) - MIN(salary) as salary_range
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department
HAVING COUNT(*) >= 5
ORDER BY avg_salary DESC;
This single statement performs filtering, grouping, counting, averaging, calculation, conditional filtering, and sorting—multiple processing operations composed together.
Stage 5: Output (Information Delivery)
Output transforms processed results into forms suitable for consumption:
Output is where data officially becomes information—presented in context for human understanding or system action. The same underlying data might produce different outputs for different audiences:
Stage 6: Storage (Information Preservation)
Storage preserves both raw data and processed results for future use:
Storage enables the feedback loop that makes continuous processing possible. Today's outputs become tomorrow's inputs for further analysis.
Note the feedback arrow from Storage back to Collection. Data processing is inherently cyclical:
This cyclical nature is why we call it a "cycle" rather than a linear pipeline. Each iteration through the cycle can refine and improve the process.
Each stage depends on the quality of the previous stage. Collection errors corrupt preparation. Preparation failures contaminate input. Input problems break processing. Processing errors produce wrong outputs. Poor outputs become poor stored data. The cycle amplifies quality—both good and bad.
Not all data processing happens the same way. Different use cases demand different approaches to when and how data is processed. Understanding these methodologies is essential for designing appropriate database solutions.
Batch processing collects data over a period and processes it as a single unit. This is the oldest and still most common processing methodology.
Characteristics:
Example Use Cases:
12345678910111213141516171819202122232425262728293031323334353637383940414243
-- Batch Processing Example: Nightly Sales Summary-- Runs once per day after business close -- Step 1: Aggregate daily transactionsINSERT INTO daily_sales_summary ( summary_date, store_id, total_transactions, total_revenue, total_items_sold, avg_transaction_value, created_at)SELECT CURRENT_DATE - 1 AS summary_date, store_id, COUNT(*) AS total_transactions, SUM(total_amount) AS total_revenue, SUM(item_count) AS total_items_sold, AVG(total_amount) AS avg_transaction_value, CURRENT_TIMESTAMP AS created_atFROM transactionsWHERE transaction_date = CURRENT_DATE - 1 AND status = 'COMPLETED'GROUP BY store_id; -- Step 2: Update running totalsUPDATE store_metrics smSET mtd_revenue = mtd_revenue + ds.total_revenue, mtd_transactions = mtd_transactions + ds.total_transactions, last_updated = CURRENT_TIMESTAMPFROM daily_sales_summary dsWHERE sm.store_id = ds.store_id AND ds.summary_date = CURRENT_DATE - 1; -- Step 3: Archive processed transactionsINSERT INTO transaction_archiveSELECT * FROM transactionsWHERE transaction_date < CURRENT_DATE - 90; DELETE FROM transactionsWHERE transaction_date < CURRENT_DATE - 90;Real-time processing handles data immediately as it arrives, producing results with minimal delay.
Characteristics:
Example Use Cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
-- Real-Time Processing Example: E-Commerce Order-- Executes immediately when customer places order BEGIN TRANSACTION; -- Step 1: Validate and reserve inventory (immediate)UPDATE inventorySET reserved_quantity = reserved_quantity + @order_quantityWHERE product_id = @product_id AND available_quantity >= @order_quantityRETURNING *; -- If no rows updated, insufficient inventory-- ROLLBACK and return error to user immediately -- Step 2: Create order (immediate)INSERT INTO orders ( customer_id, order_status, created_at) VALUES ( @customer_id, 'PENDING', CURRENT_TIMESTAMP) RETURNING order_id; -- Step 3: Create order items (immediate)INSERT INTO order_items ( order_id, product_id, quantity, unit_price) VALUES ( @new_order_id, @product_id, @order_quantity, @current_price); -- Step 4: Process payment (immediate, external call)-- ... payment gateway integration ... -- Step 5: Confirm order (immediate)UPDATE ordersSET order_status = 'CONFIRMED', confirmed_at = CURRENT_TIMESTAMPWHERE order_id = @new_order_id; COMMIT; -- Customer sees confirmation within 2-3 seconds of clickStream processing continuously processes data as an unbounded, flowing sequence of events.
Characteristics:
Example Use Cases:
| Aspect | Batch Processing | Real-Time Processing | Stream Processing |
|---|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds | Milliseconds to seconds |
| Throughput | Very high | Moderate | High |
| Data Handling | Accumulated, then processed | One at a time, immediately | Continuous flow |
| Resource Usage | Periodic spikes | Constant, moderate | Constant, scalable |
| Complexity | Lower | Moderate | Higher |
| Error Handling | Retry entire batch | Per-transaction retry | Event replay, watermarks |
| State Management | In-memory or temp tables | ACID transactions | Windowed state, checkpoints |
| Typical Tools | Stored procedures, ETL | OLTP databases | Kafka, Flink, Spark Streaming |
Modern systems often combine methodologies:
Lambda Architecture maintains parallel batch and real-time processing paths:
Kappa Architecture simplifies by using only stream processing:
The choice between methodologies depends on:
Modern database systems increasingly support multiple processing modes. PostgreSQL can handle batch ETL and real-time OLTP. Apache Kafka enables stream processing. Cloud data warehouses like Snowflake and BigQuery support both scheduled batch and interactive real-time queries.
Regardless of methodology, all data processing involves a set of fundamental operations. Understanding these operations provides a vocabulary for describing any processing task.
Recording: Converting real-world events into data entries
Coding: Transforming data into standardized formats
Verification: Confirming data accuracy at source
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- Demonstration of fundamental processing operations -- FILTERING: Select matching recordsSELECT * FROM ordersWHERE status = 'SHIPPED' AND total > 1000; -- SORTING: Order by specified criteria SELECT * FROM productsORDER BY category, price DESC; -- CALCULATION: Derive new valuesSELECT product_name, unit_price, quantity, unit_price * quantity AS line_total, unit_price * quantity * 0.08 AS tax_amountFROM order_items; -- AGGREGATION: Summarize recordsSELECT category, COUNT(*) AS product_count, AVG(price) AS avg_price, SUM(stock_quantity) AS total_stockFROM productsGROUP BY category; -- MERGING (JOIN): Combine related datasetsSELECT o.order_id, c.customer_name, SUM(oi.quantity * oi.unit_price) AS order_totalFROM orders oJOIN customers c ON o.customer_id = c.customer_idJOIN order_items oi ON o.order_id = oi.order_idGROUP BY o.order_id, c.customer_name; -- TRANSFORMATION: Convert data formatsSELECT UPPER(first_name) AS first_name_upper, DATE_PART('year', AGE(date_of_birth)) AS age_years, CASE status WHEN 'A' THEN 'Active' WHEN 'I' THEN 'Inactive' ELSE 'Unknown' END AS status_textFROM employees;Storing: Writing data to persistent media
Retrieving: Reading data from storage
Updating: Modifying existing stored data
Deleting: Removing data from storage
Reporting: Generating formatted output for humans
Transmitting: Sending data to other systems
Displaying: Presenting data on screens
Real-world processing combines multiple operations. A typical report might filter data, join related tables, calculate derived values, aggregate results, sort the output, and format for display—all in a single processing flow. SQL's power lies in composing these operations declaratively.
Database management systems are specialized data processing engines. They provide optimized implementations of processing operations along with critical supporting features.
When you submit a SQL query, the DBMS executes an internal processing pipeline:
1. Parsing: Converts SQL text into an internal representation (parse tree) 2. Semantic Analysis: Validates table/column names, checks permissions 3. Query Optimization: Determines the most efficient execution strategy 4. Execution Plan Generation: Creates a step-by-step processing recipe 5. Execution: Carries out the plan, reading/writing data 6. Result Delivery: Returns processed results to the client
Database systems employ sophisticated techniques to optimize processing:
Indexing: Pre-organized access structures that speed retrieval
Caching: Keeping frequently accessed data in memory
Parallel Processing: Distributing work across multiple cores/nodes
Push-Down Optimization: Moving processing closer to data
123456789101112131415161718192021222324252627282930
-- View how the database processes a queryEXPLAIN ANALYZESELECT c.customer_name, COUNT(o.order_id) AS order_count, SUM(o.total_amount) AS total_spentFROM customers cLEFT JOIN orders o ON c.customer_id = o.customer_idWHERE c.registration_date >= '2023-01-01'GROUP BY c.customer_id, c.customer_nameHAVING COUNT(o.order_id) > 5ORDER BY total_spent DESCLIMIT 100; /* Sample execution plan output:Limit (cost=1250.43..1250.46 rows=100 actual time=12.45..12.48 rows=100) -> Sort (cost=1250.43..1252.93 rows=1000 actual time=12.44..12.46) Sort Key: (sum(o.total_amount)) DESC -> HashAggregate (cost=1200.00..1225.00 rows=1000 actual time=11.90..12.10) Group Key: c.customer_id Filter: (count(o.order_id) > 5) -> Hash Left Join (cost=120.00..1050.00 rows=15000) Hash Cond: (c.customer_id = o.customer_id) -> Seq Scan on customers c (cost=0.00..45.00 rows=2000) Filter: (registration_date >= '2023-01-01') -> Hash (cost=85.00..85.00 rows=10000) -> Seq Scan on orders o (cost=0.00..85.00 rows=10000)Planning Time: 0.85 msExecution Time: 12.95 ms*/The query optimizer is perhaps the most sophisticated component of a DBMS. It evaluates potentially millions of execution strategies to find the most efficient one. Understanding execution plans helps developers write queries that the optimizer can process efficiently.
As data volumes and processing requirements have grown, various architectural patterns have emerged to address different needs.
OLTP systems optimize for frequent, small, fast transactions:
Characteristics:
Typical Use Cases:
OLAP systems optimize for complex analytical queries over large datasets:
Characteristics:
Typical Use Cases:
| Characteristic | OLTP | OLAP |
|---|---|---|
| Primary Purpose | Day-to-day operations | Analysis and reporting |
| User Type | Clerks, customers, applications | Analysts, managers, data scientists |
| Data Volume per Operation | Small (single rows) | Large (millions of rows) |
| Query Complexity | Simple predicates | Complex aggregations, joins |
| Transaction Duration | Milliseconds | Seconds to hours |
| Concurrency | Thousands of users | Tens of users |
| Data Currency | Real-time, current | Historical, periodic refresh |
| Schema Design | Normalized (3NF+) | Denormalized (Star/Snowflake) |
| Optimization Focus | Update speed | Query speed |
| Storage Model | Row-oriented | Column-oriented |
HTAP systems attempt to handle both workloads in a single system:
Characteristics:
Examples:
Modern data volumes often exceed single-machine capacity. Distributed processing architectures address this:
Shared-Nothing Architecture:
Shared-Storage Architecture:
Architecture choice depends on workload patterns. High-frequency transactions favor OLTP. Heavy analytics favor OLAP. Mixed workloads may benefit from HTAP or maintaining separate systems with data synchronization. There is no universal best architecture—only appropriate choices for specific requirements.
Data processing must not only be efficient but also correct. Incorrect processing produces incorrect information, which can be worse than no information at all. Database systems provide mechanisms to ensure processing quality.
The ACID properties guarantee reliable transaction processing:
Atomicity: A transaction is all-or-nothing. Either all operations complete successfully, or none do. Partial updates never persist.
Consistency: A transaction brings the database from one valid state to another. All constraints, triggers, and rules are enforced.
Isolation: Concurrent transactions don't interfere with each other. Each transaction sees a consistent database state.
Durability: Once a transaction commits, its effects are permanent, surviving system failures.
12345678910111213141516171819202122232425262728293031323334353637
-- Example: Bank transfer demonstrating ACID properties BEGIN TRANSACTION; -- ATOMICITY: Both operations or neither-- Debit source accountUPDATE accounts SET balance = balance - 500.00WHERE account_id = 'ACC-001'AND balance >= 500.00; -- Ensure sufficient funds -- Credit destination account UPDATE accountsSET balance = balance + 500.00WHERE account_id = 'ACC-002'; -- Log the transferINSERT INTO transfer_log ( source_account, dest_account, amount, timestamp) VALUES ( 'ACC-001', 'ACC-002', 500.00, CURRENT_TIMESTAMP); COMMIT;-- DURABILITY: Transfer is now permanent -- If any step fails, ROLLBACK ensures CONSISTENCY-- No partial state where money disappears or duplicates -- ISOLATION: Other transactions see either -- the state before the transfer or after, never duringQuality processing requires validation at multiple levels:
Type Validation: Data conforms to declared types
Range Validation: Values fall within acceptable ranges
Referential Validation: References point to valid targets
Business Rule Validation: Domain-specific rules
Cross-Field Validation: Related fields are consistent
The most dangerous processing errors are those that don't raise exceptions but produce incorrect results. A string truncated to fit a column, a calculation overflow that wraps negative, a time zone conversion applied twice—these create subtle data corruption that may not be discovered for months.
Data processing continues to evolve rapidly. Understanding current trends helps you design systems that will remain relevant.
Modern processing increasingly happens in cloud environments:
Serverless Processing: Execute processing logic without managing servers
Managed Services: Cloud providers handle infrastructure
Elastic Scaling: Resources scale with demand
Organizations are moving from centralized to distributed data ownership:
Machine learning is becoming embedded in data processing:
Despite technological evolution, the fundamental data processing cycle—collect, prepare, input, process, output, store—remains constant. New technologies offer new implementations of these timeless stages, not replacements for the underlying concepts.
We've explored the complete landscape of data processing—from fundamental cycles to modern architectures. Let's consolidate the key insights:
What's Next:
With an understanding of data processing, we'll next explore Structured vs Unstructured Data—the two fundamental categories of data that databases must handle, each with its own storage, processing, and retrieval challenges.
You now understand how raw data is systematically transformed into useful information through the data processing cycle. This knowledge underpins every database operation you'll ever perform—from simple queries to complex data pipeline architectures.