Semiconductor scaling and concurrent clouds - Part I

About a decade back there was a major change in the way commodity microprocessors were designed. Until then Moore's Law had focused on baking increasingly bigger and faster processor chips with a focus on single-threaded performance. The processor internal clock frequency was expected to exceed 6 GHz within a few years. Around 2004 the design emphasis changed to multi-core, and clock frequencies actually dropped. The shift to multi-core coincided with two other streams - the rise of Linux, and a renewed Web. This confluence of mostly unrelated developments paved the way for today's prevalent theme of concurrent computing on the cloud.

To properly understand these developments we need to start with semiconductor scaling. Something happened to Moore's Law around 2004, and we need to understand that first. We'll then apply those principles to get a perspective on what's happening today, in 2014. To that end this article is divided into two parts:

Part I: The secrets of Moore's Law

Part II: The era of concurrency

[Note: This article is intended for a broad technical audience, so some concepts have been simplified.]

I. The secrets of Moore's Law

We routinely talk about Moore's Law, or the doubling of transistor count every 18 months i.e; for every new semiconductor process node. Moore's Law has become an axiom in the industry, but an enormous amount of work goes to keep it that way. In fact Moore's Law may have become a treadmill for semiconductor manufacturers. If they succeed in making it work in the present node then there is enormous cost pressure to go to the next node. And if there is any delay in the 18 month rhythm, then the muttering starts.

To understand how Moore's Law is kept going we first need to understand how a transistor works.

Taps, gates, and processors

A transistor is like a a water tap. It regulates the flow of electrons (water) from an battery (an overhead tank). The water flows from a source to a drain, under the supervision of a gate valve. In a tap the gate valve has a handle, in a transistor it is a metallic contact. If the gate valve is closed the tap is OFF. When the gate valve is opened the tap is ON. If the washer in the gate valve is porous then the there is LEAKAGE. The region between source and drain is called the CHANNEL, and the physical distance between the two is called the CHANNEL LENGTH.


Once we have a tap we can build a logic circuit, also called a gate (the use of the same term 'gate' for both the valve handle and the circuit can be confusing, but is usually clear from context). The circuit below shows an AND gate - both taps must be turned on for water to flow.


Or we can have arrays of tiny water tanks each of which we can fill or empty through a tiny tap. The water level can be read using the same tap - we turn it on, use a siphon to suck out the water (if any) from the tiny tank and measure the volume using a graduated cylinder. If the volume is greater than some pre-set mark, we call that a 1, else 0. The graduated cylinder is known in industry as a sense amplifier, and the tiny tap is called a pass transistor. What we have is called a semiconductor memory. It is a Random Access Memory (RAM) because we can access any tiny tank at random, no need to go through in order. In particular this type of memory is called a dynamic RAM (Dynamic RAM), because the tiny taps tend to leak slightly and the water in the associated tank escapes, requiring a periodic refresh. This is how the DRAM in your laptop or smartphone works.


We need local memory on the microprocessor to build a cache. A cache is like a scratchpad, where a microprocessor stores recently used values. Think of it like a large register. Without a cache, each time a value is needed the processor would have to fetch it from the DRAM (main memory) outside the processor chip - at a cost of about 50 nanosec per fetch. For a 1GHz clock the period is 1 nanosec, so we don't want the processor to idle waiting for data from the DRAM. Of course, the data in the DRAM can be changed by some other thread, so a value in a cache can grow stale. Modern processors and compilers are optimized for each other so that the processor can also prefetch data, thus beating the 50ns tax.

In today's processor chips, cache memory is usually built using transistors. There is no tiny water tank, the water is stored within a fancy circuit built only with taps, called a static RAM (SRAM) cell. Cache memories can also be built using embedded DRAM (eDRAM), where the DRAM is integrated with the transistor logic process.

Once we have logic and memory, we can build microprocessors.

Insects at a window

I was once shining a torchlight (flashlight) through a glass window, and an insect outside was attracted to the brightness. It couldn't get into the room, but it was held close to the glass by the attraction of the light. That's how the gate works in a transistor, a positive voltage on the gate attracts electrons into the channel, turning the transistor ON. Once in the channel these electrons also feel the sideways electrical force of the battery, so a lateral current flows from source to drain. However, just like the insect couldn't enter the room, no current flows into the gate because there's a glass plate in between. Remember that the control is exerted vertically, while the current flow is lateral - this is just like the water tap.

To turn OFF the current it usually suffices to set the gate voltage to zero (switch off the light and the insect flies away).

The gate controls the conduction state of the channel via a capacitive effect, with the thin glass plate acting as the dielectric. From basic physics, the gate capacitance scales directly as the area of the plate and inversely as the thickness of the plate. In a silicon transistor the glass is readily available - glass is nothing but silicon oxide. Because the transistor action is based on the gate's electric field effect through the oxide, such a device is called a Field-Effect Transistor (FET).


Transistor density

To make Moore's Law work we need to shrink the dimensions of the transistor. About twenty years ago the transistor channel length was 1 micrometer, or about 50 times thinner than a strand of human hair. In an advanced semiconductor node today the channel length could be 20 nanometer, or about the length of 80 silicon molecules! So we're talking extremely small distances here, comparable to inter-atomic distances. Looking forward, as the channel lengths scale even lower we enter into the nanotechnology regime. Depending on who you talk to, Moore's Law will either slow down - or get a new lease of life.


But there's one problem with transistor size scaling. As we reduce the transistor size the area of the gate decreases, so the capacitance decreases as well. This means the gate becomes less effective in attracting electrons to the channel. To maintain the gate's effectiveness as a control valve we also need to commensurately reduce the thickness of the glass plate. This is called gate oxide scaling.

Leakage & Static Power

Gate leakage
As transistor sizes became smaller, eventually the oxide became so thin that a new phenomenon raised its head - quantum mechanical tunnelling. Observe that the gate voltage stress (water pressure) is spread across the thickness of the oxide. If we have a smaller number of atoms stacked end-to-end in the vertical dimension, then with gate oxide scaling each atom experiences more stress. The electron clouds of the atoms feel the voltage stress (electrons being charged particles) and the clouds begin to elongate. The clouds from neighbouring atoms start overlapping - and a leakage current flows through the gate. This leakage current is strongly dependent on gate voltage, it varies as exp(-1/V), so a doubling of voltage could lead to a 100X change in current.


Putting tunnelling to good use - Flash memory
This leakage path can divert channel current, so it's a nuisance. However, the same nuisance behaviour is also put to good use in making Flash memories. In a Flash memory cell there are two gates embedded in oxide - one gate is connected to a wire while the other is suspended within the oxide without any connections. Applying a voltage to the first gate causes an increased overlap of atomic clouds and a tunnelling current flows to the second gate, charging it. That is a logic ONE. When the gate voltage is removed the tunnelling stops and the inner gate retains its charge - silicon oxide is that good an insulator. It is a non-volatile memory since the charge is retained even after the battery is removed. To erase the logic ONE a reverse voltage is applied to *discharge* the inner gate, again by tunnelling. Doing this write-erase a few million times can rupture the oxide's insulating strength, so that's what limits a Flash memory's lifetime.


Now back to transistor gate leakage - the only effective way to drop this leakage is to increase the thickness of the oxide. This increases the number of atoms, so the voltage stress per atom is less, but if we do that the capacitance drops (gate becomes less effective). This problem was eventually solved by replacing the glass oxide with a new set of insulator materials called "high-K". In these materials there is additional electric polarization of molecules which increases the gate capacitance, so we can get lower leakage without giving up capacitance.


However, compared with silicon oxide, these new materials posed new manufacturing and reliability challenges. Enormous amounts of work go into solving such problems, but somehow the markets seem to discount these efforts under the banner of "Moore's Law".

Source-Drain leakage
In the OFF state there would ideally be no leakage current between source and drain (no dripping of the tap), but in practice there is. This leakage current drains your battery (empties the water tank) even when transistors are OFF.

As we reduce the channel length to stay on Moore's Law, we find that the leakage current goes up exponentially. This behaviour superficially resembles the quantum tunnelling seen in the gate, but it can be understood using classical physics. We won't worry about the details, we'll just say that this appears to be the reality of life - the stuff we want scales linearly, like X, while the stuff we don't want scales as exp(X). A vague way to understand this is to consider that for every lottery winner there are lots and lots more losers - this is called entropy by scientists but the ancients knew this too, they called it fate.

The leakage current is a sign that as the channel length decreases the gate loses control over the channel. Over billions of transistors the leakage also generates heat. Eventually the power dissipation of OFF transistors can become a sizeable fraction of the total power dissipated by a processor chip.

We would like the gate to exert a stronger control over the channel and tame this leakage. This is the rationale behind the fin Field-Effect Transistor (finFET), sometimes you'll see this device being called a tri-gate or 3d transistor. In a finFET, the gate (including the high-K dielectric) sandwiches the channel on two sides. You can see why it's called a fin - the channel region sticks out like the fin of a fish. The 3D nature is also apparent, because the transistor now 'sticks out' in the third dimension while previous it was flat (planar).


The increased overlap of the channel leads to better gate control, which results in the tap turning well and truly OFF therby reducing the leakage current. For this reason several transistor technologies in the 20nm node and smaller tend to be based on finFET's.

[There are planar transistor technologies which reduce leakage current without resorting to finFET's, but those are beyond the scope of this discussion.]

Switching Speed

The final piece we need is switching speed. As the transistor channel length decreases with Moore's Law, there is every reason to believe that the transistor switching speed will increase. Intuitively, electrons have to travel a smaller distance while in transit from the source to drain, so that means a smaller transit time. However there are other limitations on the switching speed. Let's see what these are.

Electrical inertia and overclocking
Apart from transit time there is an additional electrical inertia in a transistor, due to the gate capacitance of the next logic gate. We say that the capacitance of the next stage loads the present stage. This capacitive load also includes the fine metal wires used to connect the transistors to form logic circuits. As the logic gate switches state in response to the clock frequency and inputs from previous logic stages, the capacitive load charges and discharges.

The rate at which this charge/discharge happens is limited by the current flowing through the transistor. Imagine filling a balloon with water through a drinking straw as opposed to a garden hose. The drinking straw represents Moore's Law (shrinking transistor size). The only way to fill the balloon faster is to increase the pressure of the water. And we know that in electrical circuits the equivalent of water pressure is the supply (battery) voltage.


This explains what overclockers have known intuitively for a long time - the clock frequency of a microprocessor can be hiked by increasing the supply voltage. The relationship is approximately linear - more supply, more clock. But this can't be done forever! Remember our old friend, gate oxide tunnelling? As the supply (and hence) gate voltage increases, eventually even high-K insulators are subject to quantum tunnelling and the gate oxide begins to rupture. This is why overclocking will void whatever warranty your chip may come with - the manufacturer cannot guarantee that the chip will continue to perform reliably.

Timing closure slack
All of us have had the experience of missing a connecting flight because the incoming flight was delayed. I remember once my flight from A to B was once delayed by snow on the runway at A, and when I eventually reached the gate of the next flight at B, the aircraft was just starting to pull away from the gate. Seasoned travellers know to budget some slack time between flights to cover the possibility of a miss. Or they take a 'red-eye' non-stop flight.


It's the same scenario in a large processor chip. Multiple logic blocks and pipelines need to come together and hand off signals in a 'timely' manner. The wiring used to connect the blocks also adds to the delay. Hence the critical design phase of a large processor chip is called timing closure, when extra delays are deliberately added to the faster logic paths to get them in sync with the slower paths. This exercise needs to be repeated for the whole chip, and needs multiple passes. On top of this there are statistical variations in delay through multiple logic paths, caused by manufacturing process variations.

What happens if timing cannot be closed on a design? Then there is no alternative but to drop the clock frequency (increase the pulse period), or increase the supply voltage (reducing gate delay). But that's looking at things backwards - what actually happens is that the timing requirements are provided right when the transistor itself is being designed. The system architects run models and provide their inputs, the logic designers give theirs, and only then does the semiconductor engineer start to design the transistor. It's a complex process that not only requires theoretical knowledge, but also the tribal knowledge that comes from being involved with successive generations of process nodes.

An important area of research today is asynchronous logic circuits, where there is no global clock. Transfer of data chunks between processor blocks happens asynchronously - kind of similar to networking protocols. This type of logic may finally allow Moore's Law to break through the power wall.

Dynamic power and frequency scaling

We already looked at power dissipation due to leakage current. Now we consider dynamic power dissipation due to logic switching. The basic idea here is that logic circuits are very inefficient heat engines. They do virtually zero useful work (from the thermodynamic sense). To change a logic gate from ONE to ZERO (or vice versa) requires us to move energy by discharging a capacitor, and so far we've been doing this by discharging that energy as heat in the transistor. Could we be reusing that energy? We are still several orders of magnitude from building thermodynamically efficient processor chips. Turns out Nature has done much better with the human brain, so perhaps that is the key - to imitate life. All that is out in the future. Meanwhile, the forces of economics ensure that we stay busy building transistors and using them to serve http protocols.

Now for the more mundane task of calculating the power consumption. As indicated, we will approach the heat calculation via the energy storage required for logic. During a switching event a transistor charges (or discharges) the capacitance C of the next gate. A logic ONE means the capacitance is charged to the supply voltage V, while a logic ZERO is a discharge to zero volts. Hence the energy per switching cycle is (1/2).C.V^2. The heat rate per transistor is simply the rate of flow of energy with time, hence it is (1/2).f.C.V^2, or proportional to the switching frequency, f.

If we have N transistors the heat scales linearly with N i.e; the heat rate is (1/2).f.N.C.V^2. But note that the product N.C is roughly independent of scaling (both being dependent on gate area), so we expect power increasing linearly with the clock frequency. But we already saw that electrical inertia and timing closure lead to a nearly linear dependence of clock frequency on supply voltage. Therefore the heat rate could vary as the CUBE of the supply voltage - the first power from the f term, and then since V is proportional to f, we have the heat rate varying as f.V^2, and hence f^3.


Putting it all together

Now we have all the essential secrets that go into making Moore's Law work:

  • Channel length scaling
  • Gate oxide scaling and high-K
  • Reducing leakage using finFET
  • Clock frequency and timing closure
  • Dynamic power scaling with frequency

Our next step is to apply these secrets to understand the transition to multi-core processors, and from there to concurrent clouds. All that forms Part II of this article.