Skip to main content

Command Palette

Search for a command to run...

"Designing Data-Intensive Applications" by Martin Kleppmann - A comprehensive summary

Key Insights and Takeaways from "Designing Data-Intensive Applications" Book

Updated
16 min read
"Designing Data-Intensive Applications" by Martin Kleppmann - A comprehensive summary

I’m diving into the book “Designing Data-Intensive Applications” by Martin Kleppmann, a must-read for anyone interested in building data intensive distributed systems. Over the next few weeks, I’ll be sharing my key takeaways and insights from this fantastic book on Linkedin. 🚀

I’ll be breaking my learnings into bite-sized posts. If you’re interested in leveling up your knowledge, follow my posts, and let’s learn together! Feel free to connect to me on LinkedIn.

https://www.linkedin.com/posts/sahilbhosale63_softwarearchitecture-webapplications-systemdesign-activity-7282430702086733824-b5z-?utm_source=share&utm_medium=member_desktop

Chapter 1: Reliable, Scalable, and Maintainable Applications

In this chapter, we will be focusing on three concerns, Reliability, Scalability and Maintainability that are important in most software systems.

Reliability

Reliability is one of the most critical aspects of any data-intensive application. A reliable system continues to function correctly, delivering the desired performance, even when faced with challenges like hardware or software faults, or human errors.
However, reliability can be compromised by faults and failures in the system. Understanding the difference between the two is crucial:

  • 𝗙𝗮𝘂𝗹𝘁: A fault occurs when a component deviates from its expected behavior (e.g., hardware issues, bugs, or human mistakes).

  • 𝗙𝗮𝗶𝗹𝘂𝗿𝗲: A failure happens when the system as a whole stops delivering the required service to users, impacting overall functionality.

Scalability

Scalability is the key to ensuring your system can handle growth in data volume, traffic, or complexity. A scalable system describe a system’s ability to cope with increased load.

Describing Load

Before we can think about scaling, we need to define the current load on the system. Only then can we explore what happens when the load increases. 𝗟𝗼𝗮𝗱 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀 help describe this. The choice of parameters depends on your system’s architecture. Examples include:

  • Requests per second to a web server

  • Read/write ratio in a database

  • Number of simultaneously active users

  • Cache hit rate, etc

Describing Performance

Once the load is defined, it's important to consider how performance changes as the load increases:

  • What happens to performance if you increase load (e.g., double the traffic) without changing system resources like CPU, memory, and bandwidth?

  • How much more resources (CPU, memory, etc.) are needed to maintain performance when the load increases?

Scalability ensures that your system grows efficiently without compromising performance.

Response Time

Response time is the total time a user waits after making a request until they get a response. It includes:

  • The time it takes for the system to actually process the request (called service time).

  • Any delays caused by the network.

  • Time spent waiting in line if other requests are being processed (queueing delays).

Random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack, or many other causes.

Latency

Latency is specifically the time a request spends waiting to be processed. It's the "idle" time before the actual work on the request starts.
𝗜𝗻 𝘀𝗵𝗼𝗿𝘁,

Response time = Processing time + Network delays + Queueing time.
Latency = Waiting time before processing starts.

Key points for understanding response time and their impact on user experience:

  • Average Response Time: Commonly reported but not always accurate in reflecting typical user experience. It’s often calculated as the arithmetic mean.

  • Mean vs. Percentiles: The mean doesn’t show how many users experience a particular delay. Percentiles are better for understanding typical and worst-case response times.

  • Median (50th Percentile): The median response time is a good measure of typical user experience, where half the requests are faster and half are slower.

  • Higher Percentiles (p95, p99, p999): These show the response times for the slowest 5%, 1%, and 0.1% of requests, respectively, and are crucial for understanding the "worst-case" scenarios (tail latencies).

  • Impact on User Experience: High percentile response times affect user satisfaction and business outcomes. For instance, Amazon found that slower responses reduce sales and customer satisfaction.

  • Challenges with High Percentiles: Optimizing extremely high percentiles (e.g., 99.99th) is costly and yields diminishing returns, often influenced by random external factors.

  • Service Level Agreements (SLAs): Percentiles are used in SLAs to define acceptable service performance and availability, often including specific percentile thresholds for response times.

  • Queueing Delays and Scalability Testing: Queueing delays contribute significantly to high percentile response times. When testing scalability, ensure load generation is independent of response time to get accurate measurements.

Maintainability

Over time, various people will work on the system (both in engineering and operations), maintaining its current behavior and adapting it to new use cases.

The 3 key principles to minimize maintenance issues and avoid creating outdated (legacy) systems:

  1. 𝗢𝗽𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Ensure the system is easy for operations teams to keep running smoothly.

  2. 𝗦𝗶𝗺𝗽𝗹𝗶𝗰𝗶𝘁𝘆: Make the system easy for new engineers to understand by removing unnecessary complexity.

  3. 𝗘𝘃𝗼𝗹𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Facilitate future changes to the system, allowing it to adapt to new, unforeseen use cases as requirements evolve.

Maintainability covers many aspects, but ultimately, it’s about making life easier for the engineering and operations teams working with the system.

Chapter 2: Data Models and Query Languages

This chapter discusses various data models and query languages in database design.

Object-Relational Mismatch

A common criticism of the SQL data model is that when data is stored in relational tables, an awkward translation layer is needed between the objects in the application code and the database model of tables, rows, and columns. This disconnect between the models is often referred to as an impedance mismatch.

Object-relational mapping (ORM) frameworks like ActiveRecord, Hibernate and many others help minimize the boilerplate code needed for this translation layer, but they cannot entirely conceal the differences between the two models.

Relationships in Database

  • 1 : 𝗠𝗮𝗻𝘆 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽: When one entity participates in many relationships and the other entity participates in one relationship then this is called 1 : Many relationship. Ex: A user can be stay in single city but that city can be used by many other users.

  • 𝗠𝗮𝗻𝘆 : 𝗠𝗮𝗻𝘆 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽: When both entities participates in many relationships then this is called Many : Many relationship. Ex: A user can buy many courses and one course can be bought by many users.

  • 1 : 1 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽: When both entities participates in single relationship then this is called 1 : 1 relationship. Ex: employee can have single id card associated and that id card will also be assigned to single entity.

Choosing the Right Data Model for Your Application

  • Relational model: If your application does use many-to-many relationships then use relational model instead of document model.

  • Document model: Ideal for applications with one-to-many relationships (tree-structured data) or where records have minimal relationships.

  • Graph model: When your data has highly interconnected relationships and many-to-many connections are complex, graph models become a good choice.

However, the final choice might depend on specific application requirements, such as the need for scalability, flexibility, or performance considerations.

Chapter 3: Storage and Retrieval

Index in DB:

  • An index is an additional structure (that acts as metadata) that is derived from the primary data in DB.

  • Well chosen indexes speed up read queries.

  • Index usually slows down writes, because the index also needs to be updated every time data is written.

Underlying data structures used to implement indexing:

1. 𝗸𝗲𝘆 𝘃𝗮𝗹𝘂𝗲 𝗯𝗮𝘀𝗲𝗱 𝗶𝗻𝗱𝗲𝘅𝗶𝗻𝗴

  • To avoid running out of disk space we use compaction (Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.)

2. 𝗦𝗦𝗧𝗮𝗯𝗹𝗲𝘀 𝗮𝗻𝗱 𝗟𝗦𝗠-𝗧𝗿𝗲𝗲𝘀

  • SSTables (Sorted String Tables) store key-value pairs that allows efficient lookups and range scans (retrieving records within specific range).

  • LSM-Trees (Log-Structured Merge-Trees) are optimized for write-heavy operations and does periodical merging the data using compaction.

  • Both store data in sorted order and on disk.

3. 𝗜𝗻𝗱𝗲𝘅𝗶𝗻𝗴 𝗶𝗻 𝗗𝗕 𝘄𝗶𝘁𝗵 𝗕 𝘁𝗿𝗲𝗲

  • B-trees store key-value pairs in sorted order, enabling efficient lookups and range queries.

  • Data is organized into fixed-size blocks (called pages), each page is typically 4 KB, with one page read or written at a time.

  • The branching factor refers to the number of child page references in a single B-tree page.

  • Most databases use B-trees with 3–4 levels, minimizing page references. A 4-level tree with 4 KB pages and a branching factor of 500 can store up to 256 TB.

Types of indexing in DB.

  1. 𝗣𝗿𝗶𝗺𝗮𝗿𝘆 𝗜𝗻𝗱𝗲𝘅
    The primary index is created automatically on primary key column. Only one primary index is allowed per table.

  2. 𝗖𝗹𝘂𝘀𝘁𝗲𝗿𝗲𝗱 𝗶𝗻𝗱𝗲𝘅
    A clustered index stores all row data within the index.

  3. 𝗡𝗼𝗻-𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝗲𝗱 𝗶𝗻𝗱𝗲𝘅
    A non-clustered index stores only references to the data within the index.

  4. 𝗠𝘂𝗹𝘁𝗶 𝗰𝗼𝗹𝘂𝗺𝗻 𝗶𝗻𝗱𝗲𝘅𝗲𝘀
    If we need to query multiple columns of a table then we use multi column indexes

  5. 𝗙𝘂𝗹𝗹-𝘁𝗲𝘅𝘁 𝘀𝗲𝗮𝗿𝗰𝗵 𝗮𝗻𝗱 𝗳𝘂𝘇𝘇𝘆 𝗶𝗻𝗱𝗲𝘅𝗲𝘀
    All the above indexes are used to query exact data. In contrast, a full-text search index retrieves relevant records by analyzing the search query, including synonyms, ignoring grammatical variations, and finding words that appear near each other in a document.

Chapter 4: Encoding and Evolution

Programs store data in memory in the form of data structures like objects, array, trees, graphs, etc. However, if you want to send this data over the network or write it to a file then its not directly possible. You first have to convert this data into a sequence of bytes.

This translation from the in-memory representation to a byte sequence is called encoding (serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).

When it comes to encoding, there are two formats in which we represent data:
1. 𝗧𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗲𝗻𝗰𝗼𝗱𝗶𝗻𝗴: JSON, XML, CSV
2. 𝗕𝗶𝗻𝗮𝗿𝘆 𝗲𝗻𝗰𝗼𝗱𝗶𝗻𝗴: Protobuf, Avro

Compatibility in Binary Encoding

Compatibility refers to the relationship between a process that encodes data and another that decodes it. To ensure smooth data exchange, the format or schema used for encoding and decoding must remain synchronized, even as the schema evolves.

Types of Compatibility

  1. Backward Compatibility: Newer versions of the system can read data written by older versions.

  2. Forward Compatibility: Older versions of the system can read data written by newer versions.

Challenges with RPC:

RPC helps the client (someone who request a resource) to directly execute a function written on the server (who can serve that request).

  • Unpredictability: Unlike local function calls, RPCs can fail due to network issues or remote machine unavailability, which are outside your control.

  • Timeouts and Unknown Outcomes: RPC requests may return without a result due to timeouts, leaving uncertainty about whether the request was processed.

  • Retry Issues: Retrying failed requests can lead to duplicate actions if responses are lost but requests succeed.

  • Latency Variability: Network latency is variable and significantly slower than local function calls.

  • Data Encoding Challenges: Parameters must be encoded for network transmission, which can be problematic with complex data types.

  • Interoperability Issues: Different programming languages require datatype translations, which can be complex and error-prone.

Even after these challenges, RPC remains a valuable tool in distributed systems due to its ability to simplify the development of remote interactions.

Chapter 5: Replication

Replication involves creating multiple copies of the same data and storing them across different nodes to enhance data availability and fault tolerance.

Each node that stores the copy of the database is called as a replica.

𝗧𝘆𝗽𝗲𝘀 𝗼𝗳 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻

1. 𝗟𝗲𝗮𝗱𝗲𝗿 𝗮𝗻𝗱 𝗙𝗼𝗹𝗹𝗼𝘄𝗲𝗿 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻

  • In this model, a leader node handles both read and write requests, while follower nodes can only accept read requests.

  • The leader is often referred to as the master or primary node, while followers may be called read replicas, slaves, secondaries, or hot standbys.

  • Here, the leader first writes new data to its local storage and then sends a data change to its followers as part of the replication log or change stream.

  • Each followers takes log from the leader and updates its local copy of data by applying all writess in same order as they were processed by leader.

  • This type of replication is typically configured to be completely asynchronous.

  • Also, known as active/passive or master-slave replication

  • Often leader based replication is configured to be asynchronous.

Various replication methods used in leader and follower replication are as follows:

  1. Statement based replication

    • Here the leader logs every write request (statement) that is executed and then sends the statement log to its followers.

    • Issues with such replication can be due to using nondeterministic functions like NOW(), statements having auto incrementing columns, user defined functions, etc. Due to such things there is a possibility that such statements can generate different values every time they are executed which is not a good thing.

    • Because of such cases, other replication methods are mostly preferred over this.

  2. Write ahead log (WAL) replication

    • Storage engines which uses B-trees and log structured engine (SSTables and LSM trees) maintains write ahead log to write data to the database. We use the same log to build a replica on another node.

    • But here the same logs are used by the storage engine and for replication.

    • If replication protocol allows follower to use new software versions than leader then it requires zero down time to upgrade the software by upgrading followers and then performing failover to make any one follower as new leader, if the protocol does not allow version mismatch then downtime is required.

  3. Logical (row based) log replication

    • Here we keep the storage engine and replication logs separate (decoupling). This kind of replication log is called logical log to differentiate it from storage engine (physical) data representation.

    • A logical log for relational DB is sequence of records describing writes to the DB tables.

      • For inserted row, the log contains new values of all columns.

      • For deleted row, log contains enough information to uniquely identify the row that was deleted.

      • For updated row, log contains enough information to uniquely identify the row that was updated.

    • Since this log is decoupled from storage engine allowing leader and follower to run different versions of the DB software or even different storage engines.

  4. Trigger based replication

  • The above replication methods are implemented by database systems. But if you want a custom implementation like to replicate specific data, etc then we go with this method.

  • Here, replication is done at application layer.

  • DB has feature called triggers and stored procedures where in custom application code is triggered when write transaction happens in a database.

  • Trigger logs this changes in separate table which can be read by an external process (application code).

  • More prune to bugs.

Replication lag:

In asynchronous environment, the followers may fall behind the leader i.e they may not have latest data of some time period. This is replication lag (follower lags behind the leader). But eventually the followers will have the updated data. This is called eventual consistency.

Approached to solve replication lag:

  1. Read after write replication: If user reads the data instantly after he writes it then he should see the updated data. Reading users own information from the leader and other users information from the follower.

  2. Monotonic Stack: User will not read older data after they have seen the newer data (getting latest message first and then the second last message).

  3. Consistent prefix reads: Sequence of write happens in same order, so anyone reading those writes will always see consistent prefix (i.e data appearing in the same order).

𝗦𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗼𝘂𝘀 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻:

  • Leader sends write request to all followers at the same time and waits for all followers to respond.

  • Advantage is that all followers have up-to-date copy of data.

  • Practically not possible in cases were follower is down. Also, time consuming as we can’t wait for all the followers to respond back to the leaders write request. As during this time period, the leader can’t take any write requests.

Semi-synchronized

  • One followers is made synchronous and other followers are asynchronous. And the leader only waits for this follower to respond back once data is updated.

  • If this sync follower does not respond then a new follower is made sync follower.

  • In practice, when you enable synchronous replication in database then this is used.

  • Here, there is a guarentee that at least two nodes have up-to-date data, a leader and one follower.

𝗔𝘀𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗼𝘂𝘀 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻:

  • In this approach, data is first written to the leader and then replicated to the followrs asynchronously.

  • While this method focuses on speed and efficiency, it carries a risk of data loss if the leader node fails before the latest changes are replicated to followers.

2. Multi Leader Replication

___________________________________________________

REMAINING CONTENT COMING EVERY WEEK…

___________________________________________________

Noteworthy Points (WIP)

I will continue adding more points to the list below as I progress through the book.

  • Horizontal scaling: Distributing load across multiple machines is also known as a shared-nothing architecture.

  • Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises.

  • Polyglot persistence is a hybrid approach that uses different data storage technologies to handle different data storage needs within a software application.

  • In both relational and document DBs, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model.

  • Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database.

  • In schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it).

  • Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.