Here is a summary, conclusions, and thoughts from the first chapter of Designing Data-Intensive Applications, with a few additions based on my experience.
What influences the system architecture?
Most systems that manage data need to store it (databases), remember the result of costly operations (cache), allow efficient searching of its resources (search indexes), communicate asynchronously with another process (stream processing) and once in a while process everything from the beginning (batch processing).
It may sound obvious, but there are many ways to achieve each of these goals, depending on your needs. Some tools can accomplish several of the tasks listed above, but always at some cost, while gaining from a single implementation and hiding several implementations in one deployment.
System architecture is a game of trade-offs. Sometimes we build a cinder block instead of a glass skyscraper because the expected payoff from fast delivery is high or the consequences of not meeting a deadline are severe.
There are other factors that influence the design of the system: the experience of the team, legal requirements, or dependencies on other parts of the system.
So how do you make decisions when there are so many options and constraints? This question is difficult to answer directly, but we have three guides that point the way to a good system. They are reliability, scalability and maintainability. Contrary to popular opinion, these are not empty names served only on a marketing platter. There is nothing left to do but to make them more specific.
A reliable system works even when things don’t go our way. Don’t be naive, problems will always happen. In other words, a fault-tolerant system is one that is resistant to failures. In technical nomenclature, faults are expected, but failures, i.e. situations when the whole system has failed, we avoid like hell. So we design in such a way that faults do not escalate to failures.
How do we test whether our system tolerates failures? A drastic but effective way is to intentionally break things down yourself, preferably on a production environment. What does this involve? Without warning, but with a finger on the pulse, we kill a given service and observe how it affected the whole system. Yes, there are tools for this, like Netflix Chaos Monkey.
Let’s also answer the question of what are the sources of the problems.
Despite the use of redundant devices like a backup power source or hot-swappable CPUs, it would be foolish to think that hardware won’t cripple your system. As of April 2022, the world’s largest virtual machine provider (AWS) gives you a guarantee that they will run 99.5% of the time, which translates to just over 3.5 hours of unavailability per month. For this reason, application instances should not be on the same server, translated to Kubernetes deployments, application instances should not be on Nodes that are on the same server.
An application that is resilient to hardware faults is much easier to maintain because you don’t have to schedule downtime to perform operating system version updates or change compute instances.
You may have heard the saying that people are divided into two types: those who do backups and those who will. One of the reasons we can lose data is because we only have one copy on a hard drive that just happened to fail. It’s becoming more and more common to use data storage services that offer backups out of the box, so you may not think about it, but it’s worth having in the back of your mind because data is usually a company’s biggest asset.
If we already have an application whose instances are not on the same server, the problem practically disappears. The chance that all our sheets will fail at the same time is already so low that we do not have to bother about it. It is worse with errors that occur everywhere at once. The worst thing is that we bring them on ourselves in the code. It turns out that the biggest problem is between the chair and the keyboard. Admittedly, “in the code” does not mean in the application that we create. After all, it can occur in the library or the operating system itself. These bugs are usually dormant and are just waiting for the right input.
Why is this important?
Let’s even skip the software that operates a nuclear power plant or emergency numbers.
Reliability is the promise of every business to the consumer. Even if you’re creating another database browser (or CRUD, if you prefer) in the form of a TO DO list for the App Store, think of the sclerotic user who added a note to call his mom after two years since they last spoke. In an unreliable system, it could be lost, causing you to plunge family relationships for months to come.
It is the ability to deal with an increasing load. So what is load? It’s the amount of data we handle in a given unit of time, while the very nature of the data affects what problems we will have to solve when we want to handle more of it. A system that has to handle 100 1Mb queries per second will look different from a system that handles 100,000 1Kb queries per second, even though the volume of data per unit of time will remain the same.
Here are some factors that influence the architecture of the system:
- number of reads to the database per second,
- number of writes to the database per second,
- average query size,
- number of queries per second.
Systems are composed of many components, such as operational services, non-relational database, queue, cache, and so on, and each has its own strengths and weaknesses in terms of the nature of growing data. Furthermore, it is not the nature of the components themselves that affects scalability, but also how they are connected. If an operational service needs to communicate with 10 others to handle a request, it will be harder to manage throughput than if it had only one dependency.
To consider growth, we first need to understand what the load is on the system. It is useful to answer questions about a specific set of applications:
- What happens if the number of requests per second to the cache increases 10 times?
- How will the system handle 10 times more registered users?
- What happens if 10 times more users log in than usual?
We can run two simulations to determine system performance. The first assumes that we have a fixed amount of computing resources, we raise the load and observe the performance. The second assumes that we want to maintain the same performance, we raise the load and adjust the number of computing resources.
When we think of workload, we naturally put our attention on the number of queries handled per second (throughput). However, should we consider a query that was served after 2 seconds a success? If the final application client is the user, Amazon’s research shows that it is not. Such a long response time affect both satisfaction and the value of the shopping cart itself.
Let’s redefine the example success metric, measuring performance this time with two conditions:
- median response less than 200 ms,
- 99.9% response less than 500 ms.
This approach is better because it takes into account the probability distribution of response time, so the trade-off of measuring just the mean is still too much of a generalization.
Importantly, when testing performance under load, artificially generated requests must be sent independently of incoming responses to replicate the real world.
How to deal with the growing load?
The rule of thumb is that when one of the load factors increases by an order of magnitude, the architecture itself will also change, probably even sooner. Until then, we have two strategies for increasing load: scaling-up also known as vertical scaling, which is increasing the computing power of the particular machine on which a system component runs, and scale-out also known as horizontal scaling, which is increasing the number of computing machines.
Horizontal scaling is basically straightforward for services without state, while services that store data, such as relational databases, face a partition redistribution problem that takes downtime into account. For this reason, the wisdom developed by practitioners is: “scale up as long as you can” for services with the state.
A system that automatically adapts to the load is called an elastic system, which is now the gold standard, as opposed to manual management of computing resources.
It is a truism that the initial cost of software development is much less than its later maintenance and extension. Maintenance means fixing bugs, adapting the system for new functionalities, as well as their implementation or ongoing fight with technical debt.
So what makes a system that is easy to maintain? First of all, how simple it is for us to work with it in production, that is:
- tasks that do not require human interaction are automated,
- has clear visibility into the real-time system health check, i.e., monitoring and logging,
- has clear and simple processes that define routine actions, such as patching or restarting services.
This is, of course, mainly the DevOps work.
In contrast, how can a software craftsman contribute to a better future? Reduce accidental complexity, i.e. logic that does not stem from the problem itself, but from its implementation.
How to prevent the inevitable accidental complexity? In this case, the panacea is an abstraction. An abstraction that hides implementation details and exposes a clear and consistent interfaces.
When we answer the question “how to design reliable, scalable, and maintainable applications?” we cannot look for ready-made solutions because there are too many variables involved. It is better to refer to best practices and consider a particular implementation against them. A certain amount of humility and patience is also an indispensable element of the long-term game on the technological chessboard because the best solutions are born during the evolutionary process.