Tons of articles and books have been written about scalability and capacity planning. However, sometimes the hardest part of the job is to avoid overengineering and keep the things clear and simple. This article briefly summarizes the basic rules that let us to achieve continuous exponential growth with linear effort.

Scalability vs performance

Let’s start from some sad news. Scalable code is usually computationally inefficient. If you split a big complex task to a set of small associative operations – it will probably run slower on one machine. But it’s preferable to support unlimited horizontal scaling. Obviously, hitting the ceiling of vertical growth is much more likely.

Moreover, it lets you avoid possible points of failure and reach reliability via redundancy. So swarms of small and simple tasks definitely overcome their efficient and complex alternatives.

Asynchronous communication

If a web server processes a request in 50ms – unfortunately 2 servers won’t necessarily do it twice faster. We can improve the overall throughput by spawning more instances, but the latency will reach its lower bound as soon as we upstage the queuing issues. Good developers tend to know over 9000 ways to overcome this, but before you dig into the garbage collector tuning – don’t hesitate to postpone the actions that don’t need immediate response.

Moreover, pumping all the possible actions through a message queue lets you reach the dizzy heights of seamless integration, smoothen the load peaks and calmly handle partial outages. If some service is down (planned or unplanned) – the requests stack up in the queue to be handled when it returns.

Asynchronous communication


Asynchronous stuff looks nice on paper, but if you need a DB query to present some data to a user – sending it to RabbitMQ won’t save any time. However, if your controller needs two independent queries and each of them takes 100ms – launching them concurrently will probably cut the response time in half. The same principle works for any other blocking operation.



Concurrent queries reduce the overall latency unless the DB server starts sweating. And usually it tends to be the key bottleneck. But fortunately enough – the clients prefer reading over writing. At least you don’t need to write anything to authenticate a user, display some page and load the comments. Therefore the most of the essential flows can be served from a read replica even if the main database is down. And you can launch as many read replicas as you want. If it is not enough – feel free to launch replicas of replicas.

For the same reason it makes sense to store the most of the session-related data on the client side. It lets different servers to handle requests for the same client and eliminates the notorious locks on the sessions table (unless you overcomplicate the things and run into client-side data migrations).


Eventual consistency

But what about replication lag? Welcome to the cruel world of eventual consistency. Some essential survival skills:

  1. When you change an object and trigger some remote asynchronous action – don’t expect the changes to arrive before the recipient executes it. Rather serialize the object and send it as an argument. Sad but true.
  2. Train some virtual dwarfs to check data for consistency, sweep garbage and kill zombies in background.
  3. Use two-phase commit for transactions between microservices. It’s usually easier to validate the data before performing an action than to serve a distributed rollback.
  4. Keep the transaction logs to retrospect discrepancies.

It’s also useful to leverage the power of conflict-free replicated data types (CRDT). For instance, if you have a counter that faces some parallel increment/decrement operations – track the number of increments and decrements in separate atomic counters. The result value will not depend on the order of operations.



Third normal form is beautiful, but the computations nowadays are much more expensive than HDD space. And it’s not about electricity – it’s rather about latency. If you can save 10ms of an average response time by storing 1TB of additional data – it’s usually a good idea to proceed. Our enemy is the seek time, but it can be defeated by proper indexing and partitioning.


Caching might become tricky when data comes from different services. If you ever need to cache some information owned by another service – that service should emit a “change” event to let you invalidate the corresponding cache entries asynchronously.

Obviously, it is vital to measure cache hits/misses and LRU age. And sometimes it makes sense to separate local and global cache.


Competing on a rapidly growing market is much like racing. If you keep everything under control – it means you don’t push hard enough. Failures happen, the thing is to minimize the impact and always have an option to rollback.

First of all, fail as fast as possible. Isolate the issues and don’t let them to spread by grabbing shared resources.

Be careful with retrying non-idempotent operations. If you did not receive a response – it does not guarantee that the action had not been taken.

Never trust external services. If you depend on 5 services, and each of them has 99% SLA – you can’t guarantee more than 0,995 = 95% of availability. It means you could be offline for 18 days a year.


Reproducing a tricky bug is usually harder than fixing it. The most severe outages happen when we don’t have enough evidence to understand what’s going on. It is important to retrospect the incidents and improve monitoring to get the clues faster and separate the reasons from the consequences. However, it should not lead to endless noise of meaningless alerts.

In general, it is very useful to have a tool to get all the logs for a particular request from multiple servers and microservices.


And obviously it makes sense to track the basic metrics (Apdex score, throughput, response time etc) from all the services on the single dashboard.

Capacity balancing

Imagine you suddenly got a queue of 1000 requests that should be processed by 20 workers. An average request takes 100ms and you have a circuit breaker with 500ms timeout. Unfortunately it means that 900 requests will fail with a timeout.

Sometimes it’s easier to adjust the circuit breaker settings than to balance the capacities in realtime, but in general you should have enough evidence to track down bottlenecks and adjust the auto-scaling accordingly.

It is worth to mention that adding more resources to a group might have negative impact due to resource contention – it is good to know the optimal ratios.


Small servers are good for smooth capacity curves  and precise scaling, whereas big servers are more efficient in terms of load balancing, monitoring and latency for heavy computations. Every service has its own optimal setup, and it makes no sense to pay for better CPU if you are bound to network or SSD performance.

Obviously, spawning a new server should be a trivial automated operation. Killing a server should not have any negative impact – you should not lose any data (including logs). It is relatively simple to achieve this with “share-nothing” approach. The logs could be sent to a central location in realtime.

It is also good to automate the routine operations like deployment (without any SSH!) and to make the configuration as dynamic as possible to avoid frequent restarts.

The recipes should not expect the servers of one group to have the same configuration. They should not stick to parameters like CPU count or amount of RAM – you might need to spawn a new server with a different configuration in case of emergency.


Deployment is usually easier than rollback, but there should always be a plan B. First of all, avoid irreversible migrations and massive changes that cause downtime. The old code should keep working after applying the new migrations. If it is not possible – rather break the change apart and split it to several releases.

Feature flags are very useful to enable features one by one for a particular group of users. Keep them in configuration to disable a particular feature and partially degrade the service if something suddenly goes wrong.

Obviously, it should be possible to deploy or rollback some services separately. This article is not about continuous delivery, microservices or service oriented architectures, but the system is expected to consist of some independent blocks with well-defined interconnections and separate release cycles. If you change some API at one side – don’t forget to provide the fallback unless all the dependencies get updated. But it should not lead to a hell of “fallback chains” – don’t forget to remove the deprecated stuff and keep the “fallback window” as short as possible (according to the release cycle).


  1. Sam Newman, “Building Microservices”, O’Reilly Media, 2015.
  2. John Allspaw, “The Art of Capacity Planning”, O’Reilly Media, 2008.
  3. Cal Henderson, “Building Scalable Web Sites”, O’Reilly Media, 2006.