What I learned about writing blog posts, is that you should start with something personal to get readers attention. When I started working at Gett as a system architect half a year ago, we had about eight microservices in production. Within half a year we developed at least four more, and have plans for twice the number. In this article I would like to look back at the process, our troubles, and how we overcame them, and a bit about the present state of things.
Looking at a Gett architecture diagram a year ago, we would have seen something like this:
That’s a big monolithic system that allot of you are familiar with. In green are marked the main modules of the system.
Gett operates in four regions so that multiplies the complexity.
This was a good decision five years ago, when the project was just getting started.
But after Gett grew from four developers to thirty in less than a year, we felt that we could do better by improving our architecture.
But before we jump to solutions, let’s discuss in detail which problems we were having.
Firstly – DB load and locks. Our monolith was doing everything. Managing car classes, drivers, users, reports, dispatching. That became a critical problem, especially at peak hours. We’ve seen that even if we add more and more horsepower to our DB server, in the end, it won’t be enough. But that problem could actually be solved without moving to microservices, by smart sharding, for example.
So let’s move to our second, and more serious problem.
The number of our programmers grew from 5 to 30 in a year. Not all of them could get familiar with all parts of the system. And sprint after sprint each team was breaking features of other teams. It’s very hard to avoid in a monolith system, even if it’s well divided into modules. You still have parts that are being reused, and when they are changed, something may break, no matter how hard you test it.
And when speaking of tests, it was a pain in itself. The full suite could run for up to 40 minutes. And when test failed, developers sometimes even couldn’t understand how their change could affect others. It was time to change.
We starting with writing Payment Module as a separate service, using the same stack as our monolith; RoR and MySQL.
After that came the Location Module, which stores locations of our drivers. Since it endured very heavy loads, we decided that NodeJS and MongoDB will suit it.
Then we moved to Areas module. That module defined polygons on the global map where Gett was providing services. Since we were familiar and comfortable with Rails, we developed this service with the same technology we used for our monolith solution. But for database, this time over we choose PostgreSQL, which had better support for geo features. And here we can see the first advantage of of micro services. You don’t have to replace everything in order to use new technology.
We developed Area Service and Price Lists Service. But then noticed, that they were so tightly coupled, that it made more sense to unify them into a single service.
The opposite happened with Charging Service. Its pricing module grew so large that we decided to split it into a separate Pricing Service.
After a few weeks we noticed that we had some memory leaks with our NodeJS implementation of Location Service. Those proved to be very tricky to solve. So in a month or two we completely rewritten it using Go.
As you can see, the pattern repeats itself. With micro services it’s very easy to adapt.
Although our monolithic application didn’t disappear completely, it certainly became thinner and easier to maintain. What’s even more important, most of the new features are developed as part of existing services or added as new services.
Everything fails. Instead of having one troublesome DB per region, now you have 24.
You’re not promised to get your critical data in time, if your service is slow. You even may not get it at all, if your service is down. What do you do then?
Let’s discuss one specific problem we had:
See that Media Service at the bottom right? At least four other services access it for translation strings or images they require to operate correctly.
Usually, Media Service returns response in 6ms. But one unlucky day we had a network latency problem, it the response time grew to 30ms. And all hell broke loose.
The obvious result was that Ordering, Identity, Pricing and Availability service responses became slower too. But that affected also Charging Service, which was accessing Pricing Service. Which in turn slowed down our monolith application. No service for our clients. Ouch.
What we did to solve this issue in a long term could be divided into a few phases.
Phase 1: all services that were accessing Media Service started doing so with bulk requests, instead of requests per resource.
Phase 2: cache Media Service responses, and access it with a specified very short timeout. When a timeout occurs, the service reuses results from cache
Phase 3: implement Circuit Breaker mechanism, which allows to stop all requests to Media Service in case of numerous timeouts, and resume working with it after a certain period of time.
After this solution has proved itself, it was also implemented in some other critical services.
When discussing Phase 2, I mentioned that we cache responses from most of the services. Actually, we even do it twice, in two different caches.
One is a short term cache, used for performance reasons, and to reduce load on the remote service. We usually invalidate entries there within seconds.
The other is a long term cache, which holds the same data for hours. We start to access it only when Circuit Breaker is triggered, to be able to continue functioning, even if at degraded level.
Should you start with micro services architecture?
Probably not. It adds a lot of complexity to begin with, and has a lot of considerations. But when you start feeling that your developers are stepping on the toes of one another most of the time, micro services are a well proven solution.
You should consider adding caches and fallbacks at every integration point between microservices, and you’ll absolutely have to do it at critical points.