DDIA: Chp 4. Encoding and Evolution (Part 1)

- August 28, 2024

This first part of Chapter 4 of the Designing Data Intensive Applications (DDIA) book discusses the concepts of data encoding and evolution in data-intensive applications. As applications inevitably change over time, it's important to build systems that can adapt to these changes easily, a property referred to as evolvability (under maintainability) in Chapter 1 of the book.

Different data models handle change differently. Relational databases typically enforce a single schema at any given time, with changes implemented through schema migrations. In contrast, schema-on-read databases allow for a mix of older and newer data formats.

Data format or schema change often necessitates corresponding application code changes. However, in large applications, code changes can't always happen instantaneously. This leads to situations where old and new versions of code and data formats coexist in the system. To handle this, we need to maintain both backward compatibility (newer code can read data written by older code) and forward compatibility (older code can read data written by newer code).

The chapter dives into various formats for encoding data (JSON, XML, Protocol Buffers, Thrift, and Avro), focusing on how they handle schema changes and support systems where old and new data and code coexist. Let's get started.

Encoding/Decoding

Programs typically work with data in two different representations: 1. In-memory: Data is stored in objects, structs, lists, arrays, hash tables, trees, etc., optimized for CPU access and manipulation. 2. Byte sequences: When data needs to be written to a file or sent over a network, it must be encoded as a self-contained sequence of bytes.

The process of translating between these representations is called encoding (or serialization) when going from in-memory to byte sequence, and decoding (or deserialization) when going the other way.

While many programming languages offer built-in support for encoding in-memory objects (like Java's Serializable, Ruby's Marshal, Python's pickle), these are often language-specific and have various drawbacks.

For standardized encodings that work across multiple programming languages, JSON and XML are popular choices, especially for data interchange between organizations. Despite their verbosity, they remain widely used due to their simplicity and their huge accomplishment of getting different organizations to agree on these standards.

Binary encoding formats

The chapter then introduces the binary encoding formats designed to address the space inefficiency of JSON and XML.

Apache Thrift and Protocol Buffers are popular. They both require a schema for encoded data. They handle schema changes while maintaining backward and forward compatibility through tag numbers for each field. New optional fields can be added with new tag numbers, allowing old code to ignore unrecognized fields (forward compatibility). Fields can only be removed if they're optional, and tag numbers can't be reused (backward compatibility).

For example, a schema definition at protocol buffer would look like this. Apache Thrift is also very similar.

message Person {

required string user_name = 1;

optional int64 favorite_number = 2;

repeated string interests = 3;

}

Apache Avro uses a schema but doesn't include tag numbers in the encoded data. It supports schema evolution through reader's and writer's schemas. (see Fig 4-6) Reader/writer schema decoupling proves useful here. Forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader. Conversely, backward compatibility means that you can have a new version of the schema as reader and an old version as writer. Avro maintains compatibility by only allowing addition or removal of fields with default values. Avro is better suited for dynamically generated schemas due to the absence of tag numbers.

The book also briefly mentions Apache Parquet, a columnar storage format that supports nested data structures and offers various compression and encoding options.

The chapter highlights some merits of schema-based approaches, saying that schemas can serve as valuable documentation, and can allow for checking forward and backward compatibility before deployment.

Key takeaways are:

Consider both backward and forward compatibility when designing data models and APIs.
Be aware of the trade-offs between human-readable formats like JSON and XML, and more efficient binary formats like Thrift, Protocol Buffers, and Avro.
Remember that the choice of data format can significantly impact system performance, especially in data-intensive applications.
When working on systems that involve data exchange between different organizations or services, consider the ease of adoption and existing standards in your choice of data format.

Some questions

As ML/LLMs become more prevalent, how might our approach to data encoding and schema evolution change? Could we imagine self-adapting schemas or encoding formats that automatically evolve based on the data they encounter?

With the increasing importance of data privacy and regulations like GDPR, how can encoding formats incorporate built-in privacy features? (Maybe things like Europe only fields, etc. Or labels for privacy needs/levels of fields.) Could these formats obscure sensitive information based on the schema and the reader's permissions?

The chapter did not talk about encryption. How could these formats support encryption or even doing computation on encrypted data? Are there any formats particularly better at doing these? Plug: MongoDB has been shipping queryable encryption, and widening support for the queries.