On addressing #Meltdown and #Spectre in future silicon…

[ This is going to be part of a much longer blog post, with diagrams, soon, consider this a draft that might be useful but is going to be pages long later, and I am possibly going to edit as I re-read and improve the language ]

Last week, the world learned more broadly about a class of hardware security vulnerability known as a “side channel attack”. Three main variants (with some sub-variants, and other ongoing research) were disclosed by Google Project Zero, each relying fundamentally on the shared cache hierarchy within modern microprocessors, as well as a range of common performance optimizations.

One of those optimizations, called “speculative execution” has to do with extending “Out-of-Order” execution (invented by Robert Tomasulo in 1967, for which he ultimately won awards) to attempt to predict what an algorithm will do prior to having all of its data dependencies fully resolved. The key innovation relies upon turning in-order sequential machines as seen by programmers into “dataflow model” machines, as implemented by major silicon manufacturers. None of the modern high performance CPUs out there execute your program in any way resembling the order you actually wrote it.

Take branch prediction. When we hit an “if” statement in our program, we’re going to go in one of two possible different directions. One is “fall through”, the other is a branch to a different part of the program. We often don’t know ahead of time which way a branch is going to go because it hasn’t been fully “resolved” yet. Resolution of a branch involves evaluating both its condition, as well as loading any dependent data to perform that resolution.

Waiting for this resolution takes some time, especially if the branch condition variables aren’t in the fast CPU caches and need to be pulled in from memory (as is the case when I intentionally exploit this by performing a clflush against that condition variable in a “Meltdown” attack to force the machine to have to go perform this slow load from memory). This is time we don’t want to be sitting around twiddling our thumbs. So we don’t. We guess which way the branch is going to go and we begin to execute in that direction. If we’re right, great, we saved some time. But if we’re wrong, we need to do fixing — undo whatever it is that we performed “speculatively” in predicting the branch.

The good news for implementations is that Out-of-Order machines give us the mechanisms we need for such Speculative Execution. We have re-order buffers (and RATs, and other constructs depending upon your heritage) that separate the “architectural state” from the microarchitectural state of the machine. Until instructions complete (retire), and become part of the architectural state, they are recorded in internal structures. We can do some fancy optimizations involving swapping of internal structures that allow us to quickly commit results into architectural state, or undo speculated ops.

Thus we will store the speculative state of the machine without the conventional side effects that expose the speculation to the outside world. For example, all writes (stores) to memory are buffered and not actually written out to memory during speculation (which complicates load logic a little). This is something every high performance machine does. It works great, provided that the speculation remains a black box that is unobservable to the outside world, and provided that we can completely undo speculation.

Let’s see how this leads to “Meltdown”…

The problem with “Meltdown” is that this speculation black box is violated. It becomes violated because of an additional performance optimization that some companies added to their speculation process. In the case of those affected processors, loads being performed within a speculation “window” (a period of up to 192–224 instructions or even more) will have the usual permission checks to ensure that such access is allowed. But that permission check will happen in parallel with an attempt to load that data from the level 1 data cache (L1D). If the permission check later fails, this is recorded in the re-order buffer such that at any later attempt to retire the instruction, an exception is thrown. But the process of the data load is no longer fully atomic.

If we can disambiguate the load process and sneak in during the speculation window, performing some other activity based upon it that is observable to the outside, then we can break the black box. This is what “Meltdown” does. It exploits the fact that side channels allow us to infer whether an address is contained within the L1D cache of a processor simply by timing an access to that location. The longer the access, the further away in the cache hiearchy the address. And if we intentionally evict that address (clflush, carefully constructed array access, etc.) prior to reading it, we can guarantee it’s not in the cache. So, in Meltdown, we perform cache accesses based upon the value that we read speculatively out of the L1D, and then use those accesses to infer what the sensitive data was. Importantly, we don’t read that data directly.

Meltdown works because we do the privilege check in parallel, and also because the L1D cache is VIPT (Virtually Indexed, Physically Tagged) and thus can become populated with virtual addresses for which we have active live translations in our TLBs (Translation Lookaside Buffers) based upon the currently active page tables. We know, for example, that “prefetch” on x86 class cores will cause the TLB population (this is something Linus has ranted about before, when discussing NULL pointer prefetch) and on some cores it will also cause the load of the privileged data into the L1D cache as well.

To address “Meltdown” on affected cores is relatively trivial. Do the missing permission check. And don’t wait for instruction retirement. It’s the only way. And it’s the way that’s been in the academic literature for many years. The only reason not to do it this way is if, in the interests of “performance” you didn’t consider a cache side channel attack likely or plausible during that small (but ever increasing with CPU generations) speculation window.

The software mitigation is to ensure that addresses we don’t want userspace to read (such as kernel virtual addresses) can never be in the L1D at the same time that userspace is running, or that we don’t have live TLB entries for those addresses. Both parts are needed in order for the speculated load to occur because otherwise the core has to do a much slower load. It is possible on some cores that they will try to perform loads even for data not in the L1D, but few designers have yet gotten that crazy because of the amount of time required to wait for that load to come in. It could happen though if we ever get reorder windows that are large enough to keep that much state around.

Now, let’s think about “Spectre” like attacks against branch predictors…I’m focused here on “variant 2” since the speculated load attack is most easily addressed using software (at low impact to performance) for the moment…

Over the years, there have been attempts to teach compilers to re-order programs such that the branches contained within them are setup according to some understanding of how the CPU will favor them. For example, very simple branch predictors might do things like “BTFN” (Backward Taken, Forward Not) — modern CPUs have dedicated loop predictors that only do that kind of thing when really seeing actual loops. Other attempts at hacking up compilers to help out have sought to add “likely” and “unlikely” macros to programs (including the Linux kernel) that encode hints into the instructions emitted by the compiler. All of these are mostly nonsense and should be removed since they are ignored by modern microprocessors. Indeed, Intel explicitly ignore all of the hints such as “likely” and “unlikely” in modern cores.

Instead, modern microprocessors include very sophisticated branch predictors. For example, there might be three (or more) different predictors in a particular processor, and a voting algorithm arbitrating between them. The CPU will see an upcoming branch as early as instruction decode time (AMD processors do a pre-decode into the instruction cache, annotating branches in the instruction cache and tracking them there, Intel do some similar things), while other parts of the machine will track what those branches did in history. Various global and regional history of branches is built up and tracked using such things as a PHT, GHT, BTB, and other fancy data structures that ultimately are intended to help guess in the future what a branch will do.

And there are different kinds of branch. Direct branches are your classical “if” statement condition in which we go one way or another. Indirect branches are “far calls” indirecting through a function pointer to some other place in the program. When we hit an indirect branch we may literally have no idea where that function pointer is actually going to lead us, but often the same branch is going to be to the same pointer each time we hit the same branch. If we can predict where this branch is going to lead us, we can absorb some of the cost of pointer chasing to find the ultimate resolution of a given branch point.

For example, it might be that every time we hit a branch at a particular code instruction address in our program, we store that address in a cache/buffer internal to the microprocessor. When we hit that branch next time around, we see that we have an entry for it in our predictor tables, and we know which way it went the last time around, and where that branch took us, so we guess that it’s going to do the same thing this time around. For “performance” tradeoff (to avoid having huge content addressable tables) we index this table of (virtual — it’s indexing based off VAs from the icache) addresses using just a few low order bits of the code address of the branch.

Now it’s worth noting that modern CPU pipelines are 15+ stages long (many of them have variable sizes, while a few outliers over the years have gone totally nuts with the length in an ill fated attempt to get to 5GHz and beyond). When we get our prediction wrong, we need to flush this pipeline (possibly not entirely, there are some cute hacks) and proceed down the right path. So we generally like to get our predictions right. This means that a /lot/ of work has gone into making branch predictors both incredibly fast (this is critical path during instruction fetch and decode) as well as highly accurate.

Tradeoffs are necessary, including the size of the CAM (Content Addressable Memory) that is needed to lookup virtually indexed branches in our branch target buffers. Some processors will limit this to the low order 31 bits (just an example), while the high order bits of the address can merrily be from any application, or from the kernel, and we’ll think of them as homonyms (same branch entry). To resolve this is “trivial” in that we need to factor in the full virtual address of the branch, or the address space that contains it.

Future silicon is trivially fixed to address these problems by adding the PCID (x86), ASID, or other address space identifier to the branch target. Then we don’t allow the branch target to become poisoned by attacker code.

Temporary mitigation is simply to disable indirect branch prediction within the kernel (or hypervisor) on entry. Sure, it’s possible we could train a regular direct predictor, but the risk is low there (and if even possible in some implementations would only let us speculate very locally), while if we can train the indirect predictor to jump to gadget code in the kernel over which we have influence, we can then cause speculated loads that give us data we want to read. The expensive part of this mitigation comes from the cost of disabling the predictors on entry. Few designs do this directly without also disabling something else, and then there is the cost of context serializing MSR (x86) or other register writes during that entry code.

I’m still working on this and will turn it into a much longer post with some nice graphics. If there’s interest, I’ll try to get it on our corporate blog with some nicely produced animations of how cache attacks work.

Jon.

Computer Architect