II. The era of concurrency
In Part I we discussed the technical secrets of semiconductor scaling that has kept Moore's Law going to this day. We learnt about such things as channel length scaling, high-K, leakage current, finFET, and timing closure. In particular we saw that as channel length is aggressively scaled down, the dynamic power dissipation goes up as the CUBE of clock frequency.
We will now apply these semiconductor insights to understand the shift to multi-core commodity processors, and the emergence of concurrent computing on the cloud.
[Note: This article is intended for a broad technical audience, so some concepts have been simplified.]
A chess playing service
Let's move away from semiconductors for a bit. Say we want to build a chess playing service where millions of people play online against our new algorithm. Each player can choose a level of difficulty based on how well they play chess (we may also choose to let the Wookie win). To keep up the excitement we offer only blitz (lightning) games.
Since our chess algorithm is proprietary we don't want to embed it in an app or browser-side script. Instead we choose a client-server architecture where the client part handles only the cosmetics of display. This architecture is also useful to upgrade our chess program in case of any bugs. Swapping out code on our central server is simpler than updating an app on every client device. However, 'hot swapping' code on a live server - the proverbial midstream change of horses - is a different matter altogether (in fact it is one of the benefits of building systems using Erlang).
In this article we'll assume that the client-side app or html5 widget has already been built for presenting the chess board graphic, animating piece movements, and handling communications to/from a web server. Our focus will be on this web server, which in its simplest form would be a single computer connected to the Internet using a static IP address. The job of the server is two-fold: (i) field a two-way communication link to each client, and (ii) compute new moves in response to each player.
[Note: Don't try this at home. Hosting a web server on a residential connection will lead to network traffic patterns that your internet service provider (ISP) is likely to disapprove of. For this reason web servers are usually installed in special data centres, but that is truly getting ahead of the story.]
We want our chess service to be performant (play chess fast and win), but we also want it to scale (play millions of people simultaneously). Note that performance and scalability are very different things. A scalable service may be envisaged to play millions of (ordinary) players, while a performant one might be designed outwit a (single) top grandmaster. Question is to what extent can we build a service that does both.
In practice the service will be more complex than a web server, but the simplest version would consist of only two parts: (i) network sockets, and (ii) main program.
Think of the network sockets as a row of telephone operators, behind whom sits our main program - a champion chess grandmaster who doubles as the CTO of our start-up. Players from around the world call our toll-free telephone number and our operators (who are standing by) field the calls. Each operator has a chess board and a two-way chess clock, and players indicate their moves using chess rank-and-file notation. Our grandmaster prowls the room studying the boards and making moves.
The service is online, a hundred eager players have called in, the game is afoot. What is the best strategy for our grandmaster? Should she go around the room playing each board in cyclic order? Remember this is a blitz game, so it's possible that when she moves to a board that player is still thinking. So we come up with a little lightbulb at each phone cubicle - the bulb lights up only when the clock is ticking our time, so the grandmaster knows which board to skip.
Performance vs. Scale
Fine. What happens when one of the boards is tough - another grandmaster has quietly called in on a line (chess is a mind game), and since the calls are anonymous we don't know who's playing. Our grandmaster is forced to think more on this board, PLUS the seconds are ticking away on all other boards that are lit up. Should she win this game (she has a good idea who has called in) by taking a risk on the other boards? What if the grandmaster wolf pack has gathered - now there seem to be two strong boards... then three, four,.... You can see why performance and scalability are often considered orthogonal goals.
Playing simultaneous games isn't easy, even for our grandmaster CTO. On moving to each new board she has to mentally switch context - she has to refresh her mind with the position: which opening is being played, the current line of play, any immediate threats, and possible continuations. This context switch takes some time and energy, and it wears out our CTO. We want to make things easier for her, so we look for ways in which we can help. We may not play chess well, but we can at least run around carrying messages or serving refreshment.
Lets now put this in the server context. Say we had a single main program that did both jobs: not only compute the chess moves but also service the sockets. Since we don't want a waiting socket to block the main program, we decide to use multiple processor threads. Think of a thread like time sharing a processor. We create a thread pool of resources (it's hard not to visualize a sewing basket with multiple needles and threads) and assign an available thread to every new player opening a socket (think incoming phone call). Our program then walks the sockets, and when a socket is lit the processor switches to the corresponding thread. If the transfer is synchronous, the thread has to wait till the data transfer from the client is complete - so the thread is said to block. The operating system (OS) switches to another thread, recall that this context switch comes with a cost. Every new thread comes with its own memory overhead, and as the number of simultaneous players goes up we start to empty our thread pool and use up all the memory.
In this situation we decide we need a machine with a faster processor and more memory. This is roughly how web services were operated about a decade ago, the emphasis was on machines that scaled up clock frequency and memory capability.
Workloads and Commodity Processors
It's important to emphasize that in every generation of Moore's Law there have always been custom-built high-end processors and machines. So it's not correct to claim that the path toward high-performance began with commodity processors. One way to look at this is to consider something like a telephone directory enquiry (411) service. In the past - say thirty years ago - people called this service with queries and operators looked up the information on a central computer. Some manufacturers specialized in 411 computers, some others focused on airline reservations, and so on... there were a heterogeneous mix of machines tuned for each application, business was good.
Then some smart people realized there was an enormous amount of duplication in these systems. They began to offer a standard database running on a standard machine, and these databases could be customized for applications using specialized schema, or data structures, optimized for fast query operations. Soon everything started looking like a database (the way everything looks like search today) and there was good business not only in database software but also in building high-performance processor chips capable of accessing large amounts of memory (DRAM).
However not every application is suited for such an architecture. In the computing world the sophisticated term for this is workload. This term conjures up a vision of sorting of one's laundry before a washing machine, amidst a fragrant ecosystem of detergent and related materials. Clearly a home washing machine (PC/laptop) is quite different from doing one's inners in the hotel sink (smartphone), and nowhere near the scale and scope of the large medical-grade machines in a hospital (supercomputer). The 'cloud' would be something like a very large laundromat, so hang on to this imagery.
Why is workload classification important? For example, the pattern of 411 calls over time typically follows some probability distribution, with some average volume of queries and a certain pattern of lulls (or pauses) between queries. These lulls are important, on the average they give our grandmaster a chance to take a break. But there are other systems where the workload is intensely continuous - such as for example a service that processes oil exploration logs. In oil exploration it's common to take seismic "ultrasound" sonograms of the underlying sand and rock in a particular region, and huge volumes of such data are recorded on magnetic tapes (now disks) and taken to a central computer. The tapes feed vast computers that run complex signal processing algorithms, typically correlations between sets of data. Then graphics workstations are used to visualize the processed data. An oil "find" could be worth several billions of dollars per year over a decade, so the stakes are high.
Hence, the correct question to ask whenever we hear buzz-words like 'concurrent' and 'cloud' is: "For what workload?"
Commodity processors are called that for a reason - they straddle a space between PC/laptop on the one hand and high-end specialized machines on the other. The marketing centre of gravity of these processors is what is sometimes called mid-range systems. You can think of these as a line of sports car models supported by a base business model of sedate family sedans and mini-vans.
It is in the particular context of web server workloads that concurrent clouds on commodity processors began to edge out their more sophisticated counterparts, and that's stuff like our chess server. There is an impending sense that this cloud apparatus will gradually consume other types of workloads, though that is an evolving situation and the jury is still out.
The processor power wall
Now we can begin to understand what fell apart in the scaling of commodity server processors. As the clock frequency was scaled up the heat dissipation went up as the CUBE of frequency. We noted earlier that the exact scaling depends on the logic/memory composition of the processor's architectural blocks, and the details of the pieces play a crucial role in the composition of the whole. So while a CUBE law is an extremely general hand-waving statement, it conveys the essence of the situation.
Heat management in commodity machines is not sophisticated for a reason - at the mid-low end the emphasis is entirely on air-cooled machines. In fact the the retail (laptop) end of the business wants to go fanless, since consumers find fans to be clunky and noisy.
The Thermal Design Power (TDP) of a microprocessor is the heat rate that it is designed for, at a specified supply voltage and clock frequency range. The commodity server designer must provide a heatsink and an air-cooled thermal pathway to remove this heat. The motherboard firmware (BIOS) monitors several physical parameters internal to the processor such as chip spot temperatures and bus voltage drops, with a range of alarm conditions if preset limits are exceeded. Overclockers revel in changing BIOS settings to get a performance jolt from their machines, at a potential cost to reliability. Processor manufacturers will tut-tut about dire consequences, which make sense from Part I, where we saw the effort that goes into transistor scaling. But no one really knows what happens when these reliability limits are exceeded, ymmv.
For all these reasons, traditional semiconductor scaling hit a "power wall" with frequency scaling.
Evented web servers and Operating systems
Coming back to web servers, some clever people figured out how to prevent context switches between processor threads. They said let's run only a single thread, and farm out the job of checking on sockets to the operating system (OS).
An OS is itself a program that co-ordinates the execution of other programs, and co-ordinates access to hardware such as the hard disk, network interface, keyboard, mouse etc. An OS also performs an important security function by providing restricted user accounts, as different from a privileged accounts, such as root, that run with special permissions. When a computer 'boots' the OS is 'bootstrapped' from the hard disk using a 'boot loader' (such as the GNU bootloader GRUB).
From a networking standpoint the OS is the set of phone operators and messengers scurrying around with chits of paper. Modern operating systems offer evented network sockets e.g; epoll (Linux), kqueue (FreeBSD), IOCP (Windows, AIX, Solaris). Think of an evented socket as a set of instructions to the gardener who will come when we are away from home. Rather than put a lightbulb on each phone, we set up a callback which is a set of instructions to be executed in the main (single) thread upon arrival of data. The OS buffers a transfer till it is complete (though this can be done in chunks), so this model is termed asynchronous.
Multiple asynchronous I/O calls run in an event loop on the single server thread. If an I/O blocks (e.g; reading from a network socket, or even from a file on hard disk) the event loop simply moves on to the next call on a queue. The asynchronous evented approach frees the main thread from the burden of walking sockets, context switches between threads are eliminated, and each new player can be accomodated with a small amount of memory to only hold state.
In particular, note the concurrent nature of the server. It appears to be handling all the sockets (players) at the same time, but only one of them is being serviced at a microscopic instance of time. In a parallel server we could have different machines handling different players at the same microscopic instance of time. This difference between concurrency and parallelism is clearly articulated by the Unix guru Rob Pike in this talk.
We've been flipping between web servers and processors for a good reason. Let's now see how we can make our chess service both scalable and performant. Recall that scalability is about handling more players simultaneously, so we can do that by running several evented single-threaded servers on different processor cores. If we did that we would no longer need to achieve scale by increasing the single-core clock frequency. We may even be able to run each core at a slightly smaller clock frequency!
What happens to other resources, like cache and memory? It is common for each processor core to have its own L1, or even L2, but share caches like L3. Some processor chips even offer an L4 last layer cache (LLC). The shared cache in a multicore processor is usually connected by a bus. Regarding memory (DRAM), the key point is that the OS flavour most commonly used for multicore servers is what is known as an SMP (Symmetric MultiProcessor). An SMP presents a uniform view of the DRAM for all the cores, as different from a NUMA (Non-Uniform Memory Access). For our chess server we may ask why an SMP is preferable, but suppose we start allowing players to play each other - then we would like to share state as stored in DRAM (note that an SMP multi-core system may exhibit contention of the memory controller or the memory bus). OTOH, a NUMA OS may be more suited for high performance computing (HPC) applications, such as the seismic log data processing that we talked about earlier.
Further scale out by distribution
A multi-core concurrent server allow us to scale over players, how do we get even more scaling out? The obvious way is to run more server boxes, with a load balancer in front. As the name suggests the job of a load balancer is to distribute incoming requests evenly among the available servers. Having a single load balancer is clearly a single point of failure, so they usually come in redundant configurations and could even be distributed geographically.
Load balancers also serve other important functions, such as protection against denial of service (DoS) attacks, or even distributed DoS (DDoS). A DoS attack is when innumerable malicious callers call the chess playing service and do nothing more than breathe heavily into the line. By the time operators hang up and move on, genuine callers may get an engaged (busy) signal. In such cases a load balancer could act as a first line of operators - "How may I direct your call?" Along with more incoming lines (more data centre bandwidth) a load balancer can ensure that the chess-game operators are saved the trouble of fielding the hoax calls.
Distributed systems and reliability
Again, if our chess game was multi-player i.e; allowed two callers to play each other, then we could have a situation where their connections are fielded by two evented servers running on different cores. How do they communicate with each other?
One way is to 'migrate' the state one of the players so that both are served by the same core, then they can share memory. This works, but is not suited for a massive-multi-player game, or even chat services where multiple simultaneous conversations may be ongoing. A more general solution is to have a separate distributed data store that saves state. Why distributed versus a single powerful database server?
One reason is the same old power wall for processor chips. Another good reason is reliability, or more accurately, survivability. What happens if one of our evented web servers crash? This failure could be due to a serious software exception, it could also be the physical failure of a server (say a capacitor on the motherboard power rail burned out). Sometimes a whole data centre may suffer a network outage. What happens to the chess games in progress? We could send the players a cryptic message holding them responsible: 'You have made an illegal move. Your session will now restart'. That wouldn't help the popularity of our game very much, so we need to do better.
A distributed database is a logically continuous data store spread over a cluster of several networked nodes (servers). The servers can even be distributed geographically, subject to a minimum data transport latency set by the speed of light. When some data is to be stored, it's saved in multiple nodes. If a node fails, the cluster heals itself around the failed node, and data stored on that node is duplicated onto other nodes. Riak is a good example of a distributed database, and we should mention in passing that it is written in Erlang.
Clearly, in a fast-paced game we don't want to wait until all the distributed servers acknowledge a 'write' of data. At the same time we don't want to 'fire and forget' (only spaceship gamers can do that). Hence it's common to have a fast cache that acts as the first-level store. The cache is periodically copied onto the distributed store within a data centre, and then duplicated across multiple geographical locations.
So at least the game data is stored in a redundant way, and a game can be resumed from where it crashed. But we still need to know when a game crashes. This means we need to supervise processes, keep an eye on them. Now imagine doing this for millions of simultaneous games with the chess clock running. We would also like to be able to hot-swap code to fix bugs, for example. This is the kind of scenario where Erlang excels.
That doesn't mean a great chess playing service cannot be written in C or Java or your favourite language. Hardly that. In Erlang there is a piece of folklore called Virding's Law, named after one of the creators of Erlang: in effect the law states that by the time one writes a concurrent, distributed system of interacting supervised processes with soft-real-time behaviour and hot swap capability... one has rewritten Erlang.
That's the point - why take the trouble when someone has put the stuff together for you? Erlang comes with everything, batteries included.
Now we have a basic idea of how to deal with scale for our service i.e; field a million simultaneous chess players. That leaves us with performance of our service, including playing the odd grandmaster who has called in.
Performance is a domain-specific topic, a chess-playing algorithm is very different from the signal processing ones used on seismic logs. A champion grandmaster is no guarantee for writing a strong chess-playing program. You could say that the technical role of our CTO is to in fact test our system: "You still haven't fixed the bug in the knight fork!", apart from her popularity in the chess world which impressed our investors tremendously.
So we'll just have to go hire people who specialize in chess playing algorithms, and boil that stuff down into code. But where do we put this code, how does it fit into our system architecture? It may be tempting to put the code in the callbacks. But that would be a poor choice! Remember that our evented server runs on a single thread, so any chess moves will steal hearbeats from the event loop and block other callbacks.
One solution is to run worker threads, which can block and do nothing more than think chess. These may even be written in C, with the depth of move search being optimized for processor cache sizes. We can integrate these worker threads into our main service using distributed communication protocols, since the worker thread may run on another physical core or machine. In Erlang we can run such programs as a Native Implemented Function (NIF).
Virtualization, Clouds, and Bare metal
Suppose we had only one server and two people wanted to use it simultaneously. Each is particular about having their own OS, separate file system, different IP address, root access, and being able to reboot whenever they want. Virtualization, specifically hardware virtualization, is a way to achieve this. In its simplest form it's about having a master program that runs (on the same hardware) what appears to be two separate machines. If an operating system is a program running on a processor, why not run two operating systems? Or more?
Now, at this point you may say that virtualization sounds like selling the same bridge to two people. Yes, it sounds that way, but the point is whether both people require the same performance from the bridge. In fact, consider a bridge with two traffic lanes - one going each way. That cuts the performance of the bridge in half, so why do we do that? Clearly for traffic management, otherwise it's going to be chaos. So that's the key point here - virtualization is not about the highest performance, it's about manageability at the highest required performance. We could, or would, build two bridges if we wanted performance in each direction - but the point is that we may not need that today.
In fact the trend today is not to even run two fully separate virtualized OS instances, but to share OS resources. This is the basis of what are called containerized clouds - you might have heard of docker.
What happens if we do have double the traffic, next year? Now you can see how this cloud business comes about. When we get to the point where we have more traffic, we go to our online panel and simply dial up another bridge. It's the scale-as-you-need with pay-as-you-go that makes the cloud such an attractive proposition.
The best part about cloud is that all the things we talked about - load balancers, distributed stores, worker queues,... - all these are available as services on clouds. You don't have to worry about installing, running, and maintaining these services.
But what happens when we need performance all out? Say there's an emergency on the far side and lots of traffic from there - we simply make the whole bridge one-way. This is called a 'bare metal' server. That may again sound like a fancy term for 'we're not selling this bridge to two people', but it's not.
What's happening nowadays is that people are asking what role does the OS play, can we have some of traditional OS functions handled by the distributed run-time, and let the OS manage the lowest level kernel functions. In fact, there is a corollary to Virding's Law (which we saw earlier) that says "Erlang has an OS-feel to it." So you could well think of a distributed virtualized Erlang runtime on a cluster of bare-metal servers as a separate OS, and operate efficiently within that sandbox.
This combination of virtualized hardware with integrated pay-as-you-go scalable services and the option to go bare metal in specific portions - that's what's called cloud. And it all came about through a combination of more-or-less unrelated developments - all propelled by the secrets of semiconductor scaling.
Thank you for reading so far.