Fundamentals of System Design

Before we start out, why is it important to understand system design? How does YouTube stream 4K videos in several GBs to billions of users across the globe? How did Instagram handle over 50 million likes within a day of Lionel Messi's world cup winning post?

Turns out, things become especially complex when you're serving the same interactions that you'd do in any normal CRUD application to billions (or even millions) of users.

The goal of designing a scalable system is to NOT design it to handle a billion users on day one. That would be over engineering your app to scale it might never see.

Instead, the key is to start simple and keep things extensible. During the initial stages of the product, you're better off focusing on the features and getting the user base. Follow good coding practices and some design patterns wherever needed.

Okay, now what if your app is growing and you need to build that scale? Knowing these concepts would help in identifying the situations where your app might actually need to apply the scalability concepts.

The Very Basics

At a fundamental level, you need a strong hold of how data is stored and how data moves across your system. A crucial part of getting this right is capacity estimation. How much space does the data take up?

char = 1 byte
int = 4 bytes
bigint = 8 bytes
date = 4 bytes
timestamp = 8 bytes

When you estimate the scale for a million users, you'd be using these as multipliers to compute the final space.

Building Blocks

When learning system design, there are certain terms that you'd hear being thrown around. Some of these terms are used interchangeably which does cause a point of confusion to many. These terms are used across various distributed systems components like database, message brokers, microservices, etc.

Server

This is the physical hardware layer. We use multiple servers to prevent hardware failures.

Node

This is a software process. In the context of databases, it handles the request, storage on the disk, manage indexes, communication with other nodes and much more. One server can run multiple nodes. Today, the line between a server and node is blurred by virtualization and containerization. One server usually runs one node in highly distributed architecture.

Cluster

A cluster is a group of nodes. The group of nodes would be connected to a proxy which acts like an entry point to manage the requests.

Partition

A partition is logical grouping of related data. A data in a database is partitioned w.r.t. certain partition keys. Data that is frequently accessed together are partitioned together. This reduces the number of row scans on the database enabling faster queries.

Sharding

Sharding is partitioning on steroids or horizontally scaling partitions. Instead of splitting data logically in one database server, you split them across multiple physical servers (or nodes). If one shard fails, only that data is affected. Eg. One shard of data would be users from 1 - 1M.

Replication

Replication is duplication of shards to better reliability. Sharding and replication often go hand in hand. If you shard without replication, all data on the shard is gone forever if the shard dies. If you replicate without sharding, you'd completely rely on vertical scaling to handle the traffic.

Think of it like a matrix - sharding are the logical columns of the divided data, replication is duplication of those nodes for each shard. There are different forms of replication but commonly you'd have a leader node that writes data and follower nodes that handle the reads.

The main replication methods are

Single Leader Replication
Multi Leader Replication
Leaderless Replication

Postgres uses Single Leader Replication strategy. If you look at a 3 node Postgres setup, you can see how both concepts exist:

The Sharding: You decide Node A will handle one-third of the data, Node B the second third, and Node C the final third.
The Replication: You then add a "Follower Node" for each of those.

Now, Shard 1 is replicated across Node A (Leader) and Node D (Follower).
Shard 2 is replicated across Node B (Leader) and Node E (Follower).
Shard 3 is replicated across Node C (Leader) and Node F (Follower).

In this scenario, you have 6 nodes total (3 shards, each with 2-way replication).

Caching

According to Pareto's principle -

A small subset of causes are usually responsible for a larger phenomenon.

This is true in case of building scalable apps. In case of Instagram, certain posts by celebrities often account for a major portion of the interactions on the system but account for a small number of users on the app.

To mitigate this, caching is often employed as a strategy to limit the number of trips to the database. Redis is the commonly used tool for caching. The too main types of caching include:

Look Aside Cache: This is the industry default way of caching. The application server checks the cache first. If not present, it goes to the database, fetches the data and populates the cache.
Read Through Cache: Here, the cache is in charge of everything. If a data requested by the application server is not present in the cache, the cache itself fetches the data from the database, caches it and returns it back to the incoming request. This requires a custom embedding layer that enables the cache to query the database on its own.

Object Store

An object store is used to store blobs of unstructured data. Commonly used to store static data like files. They are much cheaper than databases because there's very little compute involved. Most of the cost is for the disk space and network traffic to move the bytes. The operations are relatively simple compared to complex database queries.

Some examples of object store include AWS S3, Cloudflare R2 and Azure Blob Store.

Content Delivery Network (CDN)

A CDN is a geo distributed caching layer specifically meant for static content. This enables websites to load static assets like images, videos, files quickly. This happens because the files are cached in a data center close to the host system. Examples of CDN include AWS Cloudfront, Cloudflare CDN and Azure Frontdoor.

OLTP vs OLAP Databases

Online Transactional Processing (OLTP) databases are the commonly used databases the applications use to store and retrieve data. Data is commonly stored in rows. They use normalized schemas optimized for fast reads and updates.

Online Analytical Processing (OLAP) databases are used for analytics and reporting. The data is stored in a column format for faster access to large amounts of specific information in a data warehouse.

Message Queues & Brokers

Message queues are simpler components within the message broker that manage the flow of data from one point to another. Message brokers use multiple message queues, exchanges and rules to allow decoupled modules pass messages to each other in an async manner. They are often used to reduce the write load on a database by acting as a buffer. Examples include RabbitMQ and Apache Kafka.

RabbitMQ was named after rabbits because of how fast they multiply.

Indexes

Indexes are data structures that store shortcuts to the data. Its main goal is allow the database to locate specific records without having to scan the entire table. Its very similar to the index present in a book.

The tradeoff when using an index is that they dramatically speed up READs but they add an overhead for WRITEs (INSERT, UPDATE and DELETE). Every time a record is updated, the database also has to update the corresponding index.

The commonly used indexing structures are

B-Tree
LSM Tree

B-Tree (and B+ Tree)

B-Tree indexes organizes data in a balanced tree like structure in the disk. B+ Trees are similar but faster for range queries (finding data between a certain range). Postgres uses B+ tree indexes. Its good for data that is frequently read (happens in O(log n) time). The write overhead increases because the data on the disk has to be updated whenever a record changes.

LSM Tree

LSM Tree Indexes are used for data that is write heavy. It makes use of a memtable to perform the initial set of writes. As this happens in memory, it makes writes extremely fast. The changes are then flushed to SS Tables (Sorted String Tables) which store the data on the disk. The memtable is cleaned up once its full. This does cause slower reads because data has to be read checked in both the memtable and SSTable.