As developers, we are always searching for the tools that will get the job done quickly, efficiently and elegantly. But, when it comes to programming languages, each has its pros and cons. That said, for creating high performance services, there is one language that shines brighter than the rest. At Gett, we decided to use Go exclusively for all new development, only a few months after trying it for the first time. Here’s the story of why we came to that decision.

The C10K Problem

First, to understand the problem, we’ll take a quick look at the history of handling server requests, and the challenges that come with scaling to big numbers.
Around the year 2000, the term C10K was coined by Dan Kegel, posing the problem of concurrently handling 10,000 connections on a single server. Until then, the default model was to create a new thread for each connection. For example, in an Apache web server, when a client connects, the server creates a thread which handles all the communication with the client. From a programming side, this is a very straightforward approach and easy to implement. What are the drawbacks of this approach?

Problems with thread per connection

  • CPU – Thread switching is very expensive. The OS may switch between threads at any time, and needs to save the entire thread state so that it can restore it when it switches back. For 10,000 threads, the overhead of switching takes a significant amount of CPU resources.
  • Memory – Each thread needs a stack for all its local variables, function parameters and return values, which is large enough to prevent a stack overflow. If we take a standard stack size of 1MB, for example, we will be dealing with 10GB of memory for 10K connections.
  • I/O – In many situations today, most of the server activity is I/O, like calling a database server or sending a request to another microservice. Therefore, at any given time, most of the threads will  be idle, waiting for some I/O to return.

The advent of event loops

To deal with these issues, the advent of technologies like Nginx and Node.js implemented a paradigm known as event loops. These servers use a single-thread which runs through a loop of CPU-bound code, using callbacks to manage I/O completion. In these cases, I/O is implemented using the operating system’s asynchronous APIs. This method deals with all the drawbacks previously discussed:

  • CPU – The single-thread approach minimizes context switching overhead.
  • Memory – One thread uses one stack, eliminating all the memory overhead from before.
  • I/O – All of the I/O is done asynchronously, and collected with callbacks, so there is no idle time of the main thread.
The Node.js event loop
The Node.js event loop

At Gett, when we encountered scaling issues in our driver location service, we looked for an alternative. Because of the above advantages, Node.js was selected and it served well. Unfortunately, while event loops solved the main drawbacks of thread-per-connection servers, they introduced some new issues. Here are the problems:

  • CPU – In the case of CPU intensive code, the event loop is blocked from proceeding. In other words, if one client has a large workload, all the others must wait.
  • Multiple cores – Only one thread means that only a single core is used, so in the case of a multi-core CPU, the server will not utilize all available resources, unless additional instances are used.
  • Code – Handling callbacks is very complicated on a programming level. Node.js and other event loop servers take a technical detail of asynchronous event handling from the OS, and place it in the hands of the developers. This requires writing code differently. Of course, with experience, programmers can get quite good at this, but it certainly isn’t as simple and straightforward as sequential code.

So far, we’ve seen two groups of solutions: multi-threaded and event loops. While one solves the problems of the other, they each have their own downfalls. A thread per request breaks down when there are too many requests, while event loops don’t utilize all the available resources and are very complicated to code, debug and maintain.

But, what if there was a language that could scale well, make efficient use of resources and use simple, sequential coding? This is where Go enters the picture. Here’s how Go attempts to get the best of both worlds.



Go dynamic stack size

In Go, each request from the server is handled using a Goroutine. These are similar to threads, only the language itself handles them, not the operating system. We’ll go into more detail on this in a minute, but first let’s see how Go tackles the problem of memory consumption. Each thread or Goroutine needs its own stack. We saw previously that for 10K threads using 1MB each we need 10GB, which is costly. Go’s approach is to allocate a very small stack by default for each routine, and then to increase and decrease the stack size dynamically according to the actual needs of the routine. The Go compiler analyzes the code and identifies how much stack space each function needs. If there is a chance that a function will need more stack space, the compiler inserts a runtime evaluation of the stack pointer to see if it is necessary to increase or decrease the stack size in real time. In this way, Go can start with a very small default stack size of only 2KB. Thus, for 10K server requests, we start out only using 20MB of memory for stack space instead of 10GB using threads with a standard fixed stack size of 1MB.

The dynamic stack is like a flexible, adaptable queue, instead of a rigid railing that can’t be changed

The Go scheduler

The next problem we had with threads was the high overhead of context switching. Threads are scheduled by the OS preemptively, meaning that a thread can be switched at any time. Goroutines, conversely, are scheduled by the Go runtime, and are switched cooperatively, meaning that the scheduler is aware of the code, and only switches at specific points. These points are phases in the code that are the most suitable for switching, and include I/O, channel send and receive, sleep, and runtime stack evaluation. By being aware of the code and only switching at these specific points, the Go scheduler only needs to save as little as three registers to store and restore the Goroutines that are switched. By comparison, when the OS performs thread switching, because it is unaware of the code, it must store all the CPU registers, which is what makes thread switching so expensive. By using cooperative scheduling, the Go scheduler significantly reduces the overhead of context switching, and using a Goroutine per server request is much more efficient than the thread-per-request system.

Because Goroutines eventually use OS threads to execute, it is important not to block that thread, otherwise the Go runtime will not have a way to switch that thread to other Goroutines. This means, that every operation that might block needs to be implemented either using an asynchronous API, or using native Go code. To make sure mutexes do not block the thread, they are implemented in “user space” instead of using kernel-based mutexes. When a Goroutine waits on a mutex, the scheduler puts that routine aside and notifies it when the mutex is free. If a Goroutine makes a system call, the Go scheduler creates a new thread for a new Go routine, and reschedules the first Goroutine when the call returns. For network I/O, Go uses a subsystem called a Netpoller. The Netpoller runs a single thread similar to how the Node.JS event loop is implemented, which uses network I/O with the native asynchronous OS APIs. The difference is that the developer does not have to handle the callbacks. The code can be written sequentially and the Netpoller takes care of all the callbacks behind the scenes. This way, Go enables the efficiency of asynchronous network I/O without the complexity of handling callbacks in the code.

Handling I/O is similar in Node.js and Go, but Go's Netpoller frees the developer from handling asynchronous callbacks
Handling I/O is similar in Node.js and Go, but Go’s Netpoller frees the developer from handling asynchronous callbacks

As you can see, Go manages to get the best of both worlds. It allows a very similar model to thread-per-request, but enables scaling by minimizing stack overhead, context switching overhead and schedules tasks efficiently during I/O idle time. It prevents CPU blocking, uses all the available cores and offers a very simple coding model, that is straightforward and easy to learn. You can see part of our learning process in a session where we discussed Object Oriented Programming in Go. After we used Go to replace Node.js for our driver location service, we achieved an average response of only 2 milliseconds while handling more than 200K requests per minute! So that explains our decision to use Go as our main programming language for all future development.

Learn more

If you want to see my full presentation on this topic from GoWayFest 2017, check it out here.

If you share the passion for efficient, productive programming languages, maybe there’s an open position for you here at Gett. See if something matches your skills on our careers page.