DDIA: Chapter 11 - Stream Processing

- December 10, 2024

Daily batch processes introduce significant latency, since input changes reflected in the output only after a day. For fast paced business, this is too slow. To reduce delays, stream processing occurs more frequently (e.g., every second) or continuously, where events are handled as they happen.

In stream processing, a record is typically called an event—a small, immutable object containing details of an occurrence, often with a timestamp. Polling for new events becomes costly when striving for low-latency continuous processing. Frequent polling increases overhead as most requests return no new data. Instead, systems should notify consumers when new events are available. Messaging systems handle this by pushing events from producers to consumers.

Direct messaging systems require application code to handle message loss and assume producers and consumers are always online, limiting fault tolerance. Message brokers (or message queues) improve reliability by acting as intermediaries. Producers write messages to the broker, which clients (consumers) read. Examples include RabbitMQ, ActiveMQ, Azure Service Bus, and Google Cloud Pub/Sub.

Partitioned logs store messages as append-only logs, assigning offsets to messages within each partition. Log-based brokers like Apache Kafka, Amazon Kinesis Streams, and Twitter’s DistributedLog prioritize high throughput and message ordering. Google Cloud Pub/Sub offers a similar architecture but exposes a JMS-style API. Log-based brokers suit high-throughput scenarios with fast, ordered message processing. In contrast, JMS/AMQP brokers work better when messages are expensive to process, order isn't critical, and parallel processing is needed.

Databases and Streams

Dual writes, where applications update multiple systems (e.g., a database, search index, and cache) concurrently or sequentially, are error-prone. Race conditions can occur, leading to inconsistent states (see Fig. 11-4). A better approach is to designate one system as the leader (e.g., the database) and make others, like the search index, its followers.

Traditionally, databases treat replication logs as internal details, not public APIs. However, Change Data Capture (CDC) extracts data changes from a database, often as a stream, enabling replication to other systems. For example, a database change log can update a search index in the same order as the changes occur, ensuring consistency (see Fig. 11-5). This makes derived data systems, like search indexes, consumers of the change stream.

CDC designates the source database as the leader, with derived systems as followers. Log-based message brokers like Kafka are ideal for delivering CDC events, as they maintain order. Large-scale CDC implementations include LinkedIn's Databus, Facebook's Wormhole, and Debezium for MySQL. Tools like Kafka Connect offer connectors for various databases.

To manage disk space, CDC uses snapshots in conjunction with logs. A snapshot corresponds to a specific log position, ensuring changes after the snapshot can be applied correctly. Some CDC tools automate this; others require manual handling.

Event sourcing builds on immutability, recording all changes as an append-only log of events. This approach mirrors accounting, where transactions are never altered but corrected with new entries if mistakes occur. Immutable logs improve auditability, ease recovery from bugs, and preserve historical data for analytics. For instance, a customer adding and removing an item from a shopping cart generates two events. Though the cart's final state is empty, the log records the customer’s interest, which is valuable for analytics.

Event sourcing simplifies concurrency control. Instead of multi-object transactions, a single event encapsulates a user action, requiring only an atomic append to the log. Partitioning the log and application state similarly (e.g., by customer) enables single-threaded processing per partition, eliminating concurrency issues. For cross-partition events, additional coordination is needed.

Processing streams

Processing streams supports use cases like monitoring (e.g., fraud detection, trading systems, manufacturing) and involves joins, which combine data across streams and databases. Three join types include stream-stream joins for matching related events within a time window, stream-table joins for enriching events using database changelogs, and table-table joins for producing materialized view updates. Fault tolerance in stream processing invole using techniques like microbatching, checkpointing, transactions, and idempotent writes to ensure reliability in long-running processes.