Debugging a 32-bit Fedora Arm builder issue

Fedora supports many different architectures, among them both the 64-bit, as well as the 32-bit Arm architectures. In Fedora, we call these by the names “aarch64” and “armv7hl” respectively. This post serves as a mini-writeup of a hard to debug problem with 32-bit builders, used to build packages, which in the case of Fedora run as guest virtual machines on 64-bit Armv8 hosts.

The Fedora Arm project began many years ago with efforts to support 32-bit generations of Arm hardware. In the early days, there were builds targeting the v5 32-bit architecture, then later this was replaced by the current minimum of Armv7 (technically the variant requiring the optional VFPv3 floating point hardware, and the updated associated ABI, collectively known as “armv7hl”). Armv8 introduces the nuanced notion of an execution state, of which it defines both “AArch32”, as well as “AArch64”. For the purposes of this discussion, you can think of AArch32 as being a continuation of Armv7.

For quite some time, 32-bit was the official target for Arm (and even before that, it was a “secondary” architecture thanks to the hard work of countless others who did the lifting). Indeed, even long after 64-bit hardware became widespread, 32-bit was the officially supported target (64-bit is now also long since officially supported), and for many reasons it remains an important and necessary target for a variety of users. Not the least of which is Fedora itself.

Builds in Fedora are always performed “on target”, meaning that Arm builds happen on Arm hardware. In the case of 32-bit, this used to be only early generation 32-bit servers, but later moved to 64-bit hosts running 32-bit guest virtual machines. One of the benefits of moving to 64-bit Arm server hosts was that one could have much more capable server-grade hardware, with large numbers of cores and memory available to 32-bit virtual machine builders, which can then be managed almost identically to those of other architectures (including “aarch64”, which also leverages VMs for builders).

In the case of the current generation of builders, the guest environment is running in AArch32 (backward compatible with Armv7) execution state, using the LPAE (Large Physical Address Extensions) added in Armv7 in order to support builders with greater than 32-bits of physical memory. As with the similar “PAE” approach used on 32-bit x86_64 systems to address greater than 4GB of physical memory, individual programs are still capped to using at most 4GB, but you can run many such programs on an LPAE system. This is useful when running builders that want to maximize available memory for builds.

Fast forward to May 2018 and the opening of this bug impacting the 32-bit Fedora Arm builders, which happened to be running a different host kernel:

The problem seemed to be divided into two different kinds of error:

  1. Guest processes would sometimes crash with an obscure error message (“Unhandled prefetch abort: implementation fault (lockdown abort) (0x234)”). The guest would continue running, aside from this.
  2. The host hypervisor would occasionally “pause” a guest with a log in the host kernel (“load/store instruction decoding not implemented”).

In both cases, builds were impacted because the processes running them would be stopped and have to be restarted. Over the past year, we’ve had various discussions on and off about this problem, but thanks to the hard work of various others (who have carefully managed what is running where, including precise kernel versions), it’s never really come to a head. But it was going to ultimately become a problem that needed to be addressed. Recently, I decided I really wanted to understand what was causing this to happen.

The first thing you want to do when debugging such an obscure bug is to find a way to reproduce it as reliably as possible. Eventually, thanks to the work of others, it was possible to determine that a simple loop running part of a kernel build would be enough to eventually trigger whatever was happening:

The more interesting situation appeared (to me) to be the case that the hypervisor (KVM) ended up “pause”ing the guest as a result of a “load/store instruction decoding not implemented” error. When this happens, we are fortunate that the guest VM is left in a state that we can attach various debug and diagnostic tooling to understand what it was trying to do at the time.

Running “virsh list” on the host hypervisor, one would see output similar to:

This means that “fedora30-arm-1” is no longer executing but hasn’t quite been killed either. We can thus run a dump of its memory, again on the host:

The resulting dump file will contain both the full “physical” memory of the guest, as well as its active register file (although not nearly enough of that to be actually useful for debugging what comes next). Now, the host is a 64-bit environment, and to make life easier, I copied this dump over to another 32-bit guest for analysis. This way, I could run various 32-bit native tooling. Thus, we now have two guests and one host. I’ll call the guests “builder” (the one that was experiencing problems) and “debugger” (the one I run tools on).

On the debugger guest, I first dumped out the state of the machine registers used by the guest at the time that it was paused. This can be done with gdb:

Here, we can see that the guest was running in userspace (“” is lower than 3GB ) at an address of . Note that this address is not the same as would be reported in for the process running at the time because the guest kernel never saw a fault, it would have gone straight to the hypervisor (but more on that detail later on).

Next up, I installed the exact matching “” package for the kernel that was running on the builder. I did this by manually downloading the associated RPMs, but you can also use or just a “” command if it’s a recent kernel matching what you already have on board.

Either way, I installed the debuginfo, and its “common” dependency:

Really all I wanted was the “vmlinux”. With that available, I could use Dave Anderson’s excellentcrash” utility to extract useful information:

Once crash starts up, it drops you at a prompt from which you can use one of the various useful built-ins to display information about what the guest was doing when it rolled over. In this case, we’re going to use “” to see what the running set of tasks was. The output included:

Now I can tell crash to switch to “make” () as the current task and dump some useful information about what it was doing. Among those are all of the current VM maps that were in place for the running binary:

Included in the output is the mapping for the dynamic linker:

The first address is the “VMA” (Virtual Memory Area) for the memory defined by the region between the second and third entries (). This means that we know the guest was running code located within the binary, and from subtracting the VMA from the pc, we can know that specifically it was at offset of the binary. Now we can of course load this library under a variety of tools, including gdb.

My favorite is to use (I’m weird):

We see that is located within the function:

Perhaps more interesting, we can look at the preceding instructions to see what happened immediately before the offending load (). We previously loaded a value into r1 that was 4 bytes offset from the value in r3. We can see that r3 contains , so the previously successful load performed was from virtual memory location . Subsequently, we tried to load from virtual memory location into and fell over.

Now, we remember earlier that the reason the guest had “pause”d was that the host (hypervisor) was complaining about not being able to decode a load or store operation. There had been some speculation (prior to investigating the actual dump) that some kind of complex load/store instruction sequence was starting to appear in certain 32-bit binaries, and that this was causing the problem. Without getting too off into the weeds, suffice it to say that you can have load/store instructions in 32-bit Arm that are pretty fancy and nearly impossible to emulate when needed in hypervisors such as KVM.

But looking at the above load, it’s a very simple one. In fact, it’s even simpler than the succeeding load above it. And looking at the kernel code which outputs the “” message, we see that it comes from . Specifically from , which is called by (in ) pretty much as a fall-through. We’ll come back to this later, but foreshadowing a little, this fall-through is unfortunate and results in a useless error message based upon assumptions that turn out to be incorrect.

Ok. So it’s not a weird load/store emulation is hard issue. But somehow we weren’t able to perform a load from the location nonetheless. Let’s try to read the value from within crash. Just trying to do the load with “” won’t work:

Probably because of the presence of LPAE, crash struggles to translate the guest virtual address into the right physical address, and to read that from the crash dump file. Never fear! We can do this manually!

Crash provides us with a useful accessor macro, “”, which will perform a page table walk through a process’ virtual memory. Using , we can try to translate into the guest physical memory address contained in the dump file, which we can then directly read using the “” command. “” makes an unfortunate mess of the LPAE translation, but it does dump out the correct page table entry (PTE), from which we can read the physical address for the page that contains the data that we want:

The guest physical address is thus (adding in the lower 12 bits by taking them from the virtual address above). We know (from reading the guest virtual machine’s memory map) that it starts its view of physical addresses at . We also know that the crash dump file begins its memory dump at offset . We can find that out by running objdump:

So for every guest virtual address, we can calculate the offset in the dump file using the following calculation (in a python terminal window):

Thus the value we wanted to read in the original load was:

The guest should have loaded into but instead it triggered an abort into the hypervisor instead. At this point I started look at the page tables in use by the guest, to see if there might be some kind of protection issue that would mean I can read the memory dump, but the guest could not. Perhaps it could be as simple as a strange access protection issue, or a bad sequence of invalidation instructions in the guest had failed to properly invalidate a page table update, such that the guest page walker had an incoherent view.

I performed various “” translations for succeeding and failing loads close to the one that aborted and looked at the pages. Nothing seemed too out of whack. My trusty copy of the 7500 page Arm Architecture Reference Manual came in handy more than once, especially section G5.5 “The VMSAv8–32 Long-descriptor translation table format”, which describes the paging tables used in 32-bit AArch32 execution state when LPAE addressing is in effect.

Then I began to wonder whether it was the hypervisor’s own memory translations that were at fault. In modern virtualization environments, guest virtual machines have their own “physical” view of memory, but it is quite disjoint from the host view of its real underlying physical memory. Instead, the guest translates its virtual addresses into what it thinks are “physical” addresses, but are in fact IPAs (Intermediate Physical Addresses). These are translated a second time by the hypervisor using its own page tables.

Think of it this way: the guest’s “physical” memory is simply a mapped region in the virtual memory of a host (qemu-kvm) process that is backing the KVM virtual machine instance. With this in mind, we can start to investigate the host hypervisor’s view of memory and its secondary translation tables.

As I mentioned before, whenever a KVM guest is running, it is backed by a “qemu-kvm” process running on the host. By taking a look at the running processes on the host (“ps fax”), we can see which one is backing our “pause”d virtual machine, using the ps command and a simple grep:

The output is certainly not as brief in reality. Next up, we can look at the mapped virtual memory for this qemu-kvm process. KVM virtual machine “RAM” is implemented in the form of one or more “ramblock” regions of virtual memory allocated by qemu using a regular “malloc” allocation. We can see what regions are mapped by using the /proc/PID/maps interface:

Scanning through the output, we see a large mapping at equal in size to 16GB (or the size of the virtual machine’s physical memory):

Thus we can also read guest physical memory by accessing appropriate offsets within the qemu-kvm process’ ramblock region. For example, the previous guest physical address of can be translated into the virtual memory mapping used by the hypervisor prior to its stage2 translation:

We can run a second (“live”) “” process on the host hypervisor. If the appropriate debuginfo RPMs are installed, this is as simple as:

Setting the current task to the “pause”d qemu-kvm process (with “”) we can begin to examine the state of the process from the point of view of the host. Sure, we could do the same thing with gdb but we’ll want some of the more advanced features in a few moments. An example of these is the ability to read (with “”) the guest memory from the hypervisor viewpoint:

We can use the same “vtop” command to leverage the host’s page tables in order to translate this host virtual address into the physical address on the host side. This will show us the page table entries the host is using to perform the second address translation into the real physical address on the host.

At this point, it would be useful to understand more about KVM’s view of the world. The qemu-kvm process does provide a “monitor”, and you can talk to this using “virsh” on the command line, but the interface is pretty primitive and doesn’t provide any of the information that I wanted. It is, however, a useful thing to have such a monitor, and it is growing in capability all the time. It’s also possible that I missed the way one is “supposed to do this” :)

Anyway. We can gather still yet more information by walking through KVM’s internal hypervisor state. To do this, we must first remind ourselves that KVM uses a relatively simple control interface in order to communicate between qemu-kvm (the process that runs on the host) and the “” kernel module that manages the actual virtual machine state. The API is defined in the kernel documentation, which says that we perform ioctl’s on .

Thus, in order to manage a running KVM virtual machine, the qemu-kvm process must have an active file handle upon which it is making ioctls. We can find this by using the “” command in crash:

The output includes all files that are currently open by the qemu-kvm host process. Among them, is this file that it is using to make requests into the KVM kernel module to start hardware virtualization, and other purposes:

We can use the “” command in crash to walk further into this open file pointer being using by qemu-kvm to talk to its hypervisor:

Reading the KVM source code, specifically () tells us that KVM stashes a pointer to a “” containing VCPU state in the private_data pointer of this open file. We can find the private_data pointer in the above output and dereference it:

We can use this to reconstruct and obtain any state we want from the KVM instance by reading the structure, paying particular attention to the “” sub-structure, and such sub-sub-structures as “”. It contains a copy of various registers used by the hypervisor to identify the reason for a fault. Of these registers, a few of them are particularly interesting:

The first is the Exception Syndrome Register (). It tells us information about why we aborted into the hypervisor (more on that later on). The second Faulting Address Register () tells us what guest address lead to the trap. Unsurprisingly, this is the offending address we looked at previously. Lastly, we have the Hypervisor (IPA) Faulting Address Register (). This one contains the address in guest physical memory that caused the trap.

I spent many more hours than even might be inferred from the above manually walking through various parts of the guest and host memory. One of the things I did was to walk through the guest’s page tables for the currently running process within the guest at the time of the fault. I did a lot of this manually, by taking the page table base register for guest userspace, known as the Translation Table Base Register () and performing the dozen or so operations needed to walk down to the individual PTEs (Page Table Entries) that describe the guest physical address for a given virtual one.

Eventually, I realized that when we received these traps into the hypervisor, the register always contains an address that is close to that of the guest physical address of the last level PTE leaf needed to perform the translation from the guest FAR (faulting virtual address) into a guest IPA (guest physical address). The only difference was that the value within the register was truncated, missing the upper bits of the guest faulting IPA. became , dropping the “4”.

Finally, I hypothesized that what was happening was that guests performing a stage 1 page table walk were hitting a faulting condition (e.g. host page backing the PTE needed an access bit update that is software managed in Armv8.0) during the walk which required a trap into the hypervisor in order to manage the page table update (e.g. access bit update). In that case only, there were occasions in which the value contained within might have been truncated, missing the top address bits added by the Arm LPAE.

My solution? In the case that we know we received a hypervisor abort due to a stage 1 translation (which is helpfully communicated to us in the ), we can reconstruct the value that should have been in the by manually walking the guest page tables and comparing the leaf nodes. If the leaf matches the faulting address minus the top bits, inject the correct address instead and allow the guest to continue its page walk, successfully this time.

We haven’t fully run this to ground yet, so it’s not clear whether there is a hardware errata, or some software issue we haven’t identified in Linux somewhere that is causing this odd truncation in certain unusual cases, and even if there were a hardware issue, there is hopefully a fairly straightforward workaround without needing to reconstruct the faulting address as above.

The purpose of this blog isn’t to point blame in any particular direction. Especially since we don’t know the exact circumstances of any bug or exactly how it is triggered. But I did want to write this up while it was still fresh, to aid those who may follow since it was a particularly interesting one in my mind.

Computer Architect