Debugging a 32-bit Fedora Arm builder issue

Fedora supports many different architectures, among them both the 64-bit, as well as the 32-bit Arm architectures. In Fedora, we call these by the names “aarch64” and “armv7hl” respectively. This post serves as a mini-writeup of a hard to debug problem with 32-bit builders, used to build packages, which in the case of Fedora run as guest virtual machines on 64-bit Armv8 hosts.

The Fedora Arm project began many years ago with efforts to support 32-bit generations of Arm hardware. In the early days, there were builds targeting the v5 32-bit architecture, then later this was replaced by the current minimum of Armv7 (technically the variant requiring the optional VFPv3 floating point hardware, and the updated associated ABI, collectively known as “armv7hl”). Armv8 introduces the nuanced notion of an execution state, of which it defines both “AArch32”, as well as “AArch64”. For the purposes of this discussion, you can think of AArch32 as being a continuation of Armv7.

For quite some time, 32-bit was the official target for Arm (and even before that, it was a “secondary” architecture thanks to the hard work of countless others who did the lifting). Indeed, even long after 64-bit hardware became widespread, 32-bit was the officially supported target (64-bit is now also long since officially supported), and for many reasons it remains an important and necessary target for a variety of users. Not the least of which is Fedora itself.

Builds in Fedora are always performed “on target”, meaning that Arm builds happen on Arm hardware. In the case of 32-bit, this used to be only early generation 32-bit servers, but later moved to 64-bit hosts running 32-bit guest virtual machines. One of the benefits of moving to 64-bit Arm server hosts was that one could have much more capable server-grade hardware, with large numbers of cores and memory available to 32-bit virtual machine builders, which can then be managed almost identically to those of other architectures (including “aarch64”, which also leverages VMs for builders).

In the case of the current generation of builders, the guest environment is running in AArch32 (backward compatible with Armv7) execution state, using the LPAE (Large Physical Address Extensions) added in Armv7 in order to support builders with greater than 32-bits of physical memory. As with the similar “PAE” approach used on 32-bit x86_64 systems to address greater than 4GB of physical memory, individual programs are still capped to using at most 4GB, but you can run many such programs on an LPAE system. This is useful when running builders that want to maximize available memory for builds.

Fast forward to May 2018 and the opening of this bug impacting the 32-bit Fedora Arm builders, which happened to be running a different host kernel:

The problem seemed to be divided into two different kinds of error:

  1. Guest processes would sometimes crash with an obscure error message (“Unhandled prefetch abort: implementation fault (lockdown abort) (0x234)”). The guest would continue running, aside from this.
  2. The host hypervisor would occasionally “pause” a guest with a log in the host kernel (“load/store instruction decoding not implemented”).

In both cases, builds were impacted because the processes running them would be stopped and have to be restarted. Over the past year, we’ve had various discussions on and off about this problem, but thanks to the hard work of various others (who have carefully managed what is running where, including precise kernel versions), it’s never really come to a head. But it was going to ultimately become a problem that needed to be addressed. Recently, I decided I really wanted to understand what was causing this to happen.

The first thing you want to do when debugging such an obscure bug is to find a way to reproduce it as reliably as possible. Eventually, thanks to the work of others, it was possible to determine that a simple loop running part of a kernel build would be enough to eventually trigger whatever was happening:

$ while true; do make clean; make ARCH=arm KCONFIG_CONFIG=../../kernel-armv7hl-debug.config olddefconfig; done

The more interesting situation appeared (to me) to be the case that the hypervisor (KVM) ended up “pause”ing the guest as a result of a “load/store instruction decoding not implemented” error. When this happens, we are fortunate that the guest VM is left in a state that we can attach various debug and diagnostic tooling to understand what it was trying to do at the time.

Running “virsh list” on the host hypervisor, one would see output similar to:

1 fedora30-arm-1 paused

This means that “fedora30-arm-1” is no longer executing but hasn’t quite been killed either. We can thus run a dump of its memory, again on the host:

virsh dump — memory-only — format elf fedora30-arm-1 fedora30-arm-1.core

The resulting dump file will contain both the full “physical” memory of the guest, as well as its active register file (although not nearly enough of that to be actually useful for debugging what comes next). Now, the host is a 64-bit environment, and to make life easier, I copied this dump over to another 32-bit guest for analysis. This way, I could run various 32-bit native tooling. Thus, we now have two guests and one host. I’ll call the guests “builder” (the one that was experiencing problems) and “debugger” (the one I run tools on).

On the debugger guest, I first dumped out the state of the machine registers used by the guest at the time that it was paused. This can be done with gdb:

# gdb fedora30-arm-1.core

(gdb) info registers
r0 0xb6f2f640 3069376064
r1 0xb69b4138 3063628088
r2 0xb6f31000 3069382656
r3 0xb6a21958 3064076632
r4 0xb6f2f640 3069376064
r5 0x6ffffdff 1879047679
r6 0x6ffffeff 1879047935
r7 0xb69b4000 3063627776
r8 0xb6a22040 3064078400
r9 0xb6f2f79c 3069376412
r10 0xbec20574 3200386420
r11 0xbec204f4 3200386292
r12 0x6fffffff 1879048191
sp 0xbec203e0 0xbec203e0
lr 0xb6f0670c 3069208332
pc 0xb6f0b6f4 0xb6f0b6f4
cpsr 0xa0000010 2684354576

Here, we can see that the guest was running in userspace (“pc” is lower than 3GB KERNEL_OFFSET) at an address of 0xb6f0b6f4. Note that this address is not the same as would be reported in task_struct for the process running at the time because the guest kernel never saw a fault, it would have gone straight to the hypervisor (but more on that detail later on).

Next up, I installed the exact matching “kernel-debuginfo” package for the kernel that was running on the builder. I did this by manually downloading the associated RPMs, but you can also use yum-downloader or just a “dnf” command if it’s a recent kernel matching what you already have on board.

Either way, I installed the debuginfo, and its “common” dependency:

rpm -ivh kernel-lpae-debuginfo-5.1.9–300.fc30.armv7hl.rpm

Really all I wanted was the “vmlinux”. With that available, I could use Dave Anderson’s excellentcrash” utility to extract useful information:

# crash /usr/lib/debug/lib/modules/5.1.9–300.fc30.armv7hl+lpae/vmlinux fedora30-arm-1.core

Once crash starts up, it drops you at a prompt from which you can use one of the various useful built-ins to display information about what the guest was doing when it rolled over. In this case, we’re going to use “ps” to see what the running set of tasks was. The output included:

> 725 716 0 da876900 RU 0.1 5968 972 make

Now I can tell crash to switch to “make” (set -x 725) as the current task and dump some useful information about what it was doing. Among those are all of the current VM maps that were in place for the running binary:

crash> vm

Included in the output is the mapping for the dynamic linker:

da814000 b6f00000 b6f21000 875 /usr/lib/ld-2.29.so

The first address is the “VMA” (Virtual Memory Area) for the memory defined by the region between the second and third entries (0xb6f00000–0xb6f21000). This means that we know the guest was running code located within the ld-2.29.so binary, and from subtracting the VMA from the pc, we can know that specifically it was at offset 0xb6f4 of the ld-2.29.so binary. Now we can of course load this library under a variety of tools, including gdb.

My favorite is to use objdump (I’m weird):

# objdump -x -D ld-2.29.so | less

We see that 0xb6f4 is located within the _dl_setup_hash function:

0000b6e0 <_dl_setup_hash>:
b6e0: e5903150 ldr r3, [r0, #336] ; 0x150
b6e4: e3530000 cmp r3, #0
b6e8: 0a000013 beq b73c <_dl_setup_hash+0x5c>
b6ec: e5931004 ldr r1, [r3, #4]
b6f0: e92d4010 push {r4, lr}
b6f4: e5912000 ldr r2, [r1]

Perhaps more interesting, we can look at the preceding instructions to see what happened immediately before the offending load (ldr). We previously loaded a value into r1 that was 4 bytes offset from the value in r3. We can see that r3 contains 0xb6a21958, so the previously successful load performed was from virtual memory location 0xb6a2195c. Subsequently, we tried to load from virtual memory location 0xb69b4138 into r2 and fell over.

Now, we remember earlier that the reason the guest had “pause”d was that the host (hypervisor) was complaining about not being able to decode a load or store operation. There had been some speculation (prior to investigating the actual dump) that some kind of complex load/store instruction sequence was starting to appear in certain 32-bit binaries, and that this was causing the problem. Without getting too off into the weeds, suffice it to say that you can have load/store instructions in 32-bit Arm that are pretty fancy and nearly impossible to emulate when needed in hypervisors such as KVM.

But looking at the above load, it’s a very simple one. In fact, it’s even simpler than the succeeding load above it. And looking at the kernel code which outputs the “load/store instruction decoding not implemented” message, we see that it comes from virt/kvm/arm/mmio.c. Specifically from io_mem_abort, which is called by kvm_handle_guest_abort (in virt/kvm/arm/mmu.c) pretty much as a fall-through. We’ll come back to this later, but foreshadowing a little, this fall-through is unfortunate and results in a useless error message based upon assumptions that turn out to be incorrect.

Ok. So it’s not a weird load/store emulation is hard issue. But somehow we weren’t able to perform a load from the location nonetheless. Let’s try to read the value from within crash. Just trying to do the load with “rd” won’t work:

> rd 0xb69b4138
rd: seek error: user virtual address: b69b4138 type: “32-bit UVADDR”

Probably because of the presence of LPAE, crash struggles to translate the guest virtual address into the right physical address, and to read that from the crash dump file. Never fear! We can do this manually!

Crash provides us with a useful accessor macro, “vtop”, which will perform a page table walk through a process’ virtual memory. Using vtop, we can try to translate 0xb69b4138 into the guest physical memory address contained in the dump file, which we can then directly read using the “rd” command. “vtop” makes an unfortunate mess of the LPAE translation, but it does dump out the correct page table entry (PTE), from which we can read the physical address for the page that contains the data that we want:

4200004387c1fdf 4387c1000 (PRESENT|DIRTY|YOUNG)

The guest physical address is thus 0x4387c1138 (adding in the lower 12 bits by taking them from the virtual address above). We know (from reading the guest virtual machine’s memory map) that it starts its view of physical addresses at 0x40000000. We also know that the crash dump file begins its memory dump at offset 0x234. We can find that out by running objdump:

# objdump -x fedora30-arm-1.core

So for every guest virtual address, we can calculate the offset in the dump file using the following calculation (in a python terminal window):

>>> print hex(guest_physical_address–0x40000000+0x234)

Thus the value we wanted to read in the original load was:

crash> rd -f 0x3f87c136c
3f87c136c: 00000209

The guest should have loaded 0x209 into r2 but instead it triggered an abort into the hypervisor instead. At this point I started look at the page tables in use by the guest, to see if there might be some kind of protection issue that would mean I can read the memory dump, but the guest could not. Perhaps it could be as simple as a strange access protection issue, or a bad sequence of invalidation instructions in the guest had failed to properly invalidate a page table update, such that the guest page walker had an incoherent view.

I performed various “vtop” translations for succeeding and failing loads close to the one that aborted and looked at the pages. Nothing seemed too out of whack. My trusty copy of the 7500 page Arm Architecture Reference Manual came in handy more than once, especially section G5.5 “The VMSAv8–32 Long-descriptor translation table format”, which describes the paging tables used in 32-bit AArch32 execution state when LPAE addressing is in effect.

Then I began to wonder whether it was the hypervisor’s own memory translations that were at fault. In modern virtualization environments, guest virtual machines have their own “physical” view of memory, but it is quite disjoint from the host view of its real underlying physical memory. Instead, the guest translates its virtual addresses into what it thinks are “physical” addresses, but are in fact IPAs (Intermediate Physical Addresses). These are translated a second time by the hypervisor using its own page tables.

Think of it this way: the guest’s “physical” memory is simply a mapped region in the virtual memory of a host (qemu-kvm) process that is backing the KVM virtual machine instance. With this in mind, we can start to investigate the host hypervisor’s view of memory and its secondary translation tables.

As I mentioned before, whenever a KVM guest is running, it is backed by a “qemu-kvm” process running on the host. By taking a look at the running processes on the host (“ps fax”), we can see which one is backing our “pause”d virtual machine, using the ps command and a simple grep:

$ ps faxwww | grep fedora30-arm-1

6458 ? Sl 180:45 /usr/libexec/qemu-kvm -name guest=fedora30-arm-1,…

The output is certainly not as brief in reality. Next up, we can look at the mapped virtual memory for this qemu-kvm process. KVM virtual machine “RAM” is implemented in the form of one or more “ramblock” regions of virtual memory allocated by qemu using a regular “malloc” allocation. We can see what regions are mapped by using the /proc/PID/maps interface:

$ cat /proc/6458/maps

Scanning through the output, we see a large mapping at 0xfffba7e00000 equal in size to 16GB (or the size of the virtual machine’s physical memory):

fffba7e00000-ffffa7e00000 rw-p 00000000 00:00 0

Thus we can also read guest physical memory by accessing appropriate offsets within the qemu-kvm process’ ramblock region. For example, the previous guest physical address of 0x4387c1138 can be translated into the virtual memory mapping used by the hypervisor prior to its stage2 translation:

>>> print hex(0x4387c1138–0x40000000+0xfffba7e00000)

We can run a second (“live”) “crash” process on the host hypervisor. If the appropriate debuginfo RPMs are installed, this is as simple as:

# crash

Setting the current task to the “pause”d qemu-kvm process (with “set 6458”) we can begin to examine the state of the process from the point of view of the host. Sure, we could do the same thing with gdb but we’ll want some of the more advanced features in a few moments. An example of these is the ability to read (with “rd”) the guest memory from the hypervisor viewpoint:

crash> rd 0xffffa05c1138

We can use the same “vtop” command to leverage the host’s page tables in order to translate this host virtual address into the physical address on the host side. This will show us the page table entries the host is using to perform the second address translation into the real physical address on the host.

At this point, it would be useful to understand more about KVM’s view of the world. The qemu-kvm process does provide a “monitor”, and you can talk to this using “virsh” on the command line, but the interface is pretty primitive and doesn’t provide any of the information that I wanted. It is, however, a useful thing to have such a monitor, and it is growing in capability all the time. It’s also possible that I missed the way one is “supposed to do this” :)

Anyway. We can gather still yet more information by walking through KVM’s internal hypervisor state. To do this, we must first remind ourselves that KVM uses a relatively simple control interface in order to communicate between qemu-kvm (the process that runs on the host) and the “kvm” kernel module that manages the actual virtual machine state. The API is defined in the kernel documentation, which says that we perform ioctl’s on /dev/kvm.

Thus, in order to manage a running KVM virtual machine, the qemu-kvm process must have an active file handle upon which it is making ioctls. We can find this by using the “files” command in crash:

crash> files

The output includes all files that are currently open by the qemu-kvm host process. Among them, is this file that it is using to make requests into the KVM kernel module to start hardware virtualization, and other purposes:

21 ffffeb0db06abf00 ffffeb0de549b6d0 ffffeb0de076b880 UNKN kvm-vcpu:0

We can use the “struct” command in crash to walk further into this open file pointer being using by qemu-kvm to talk to its hypervisor:

crash> struct file ffffeb0db06abf00

<lots of other output here>

private_data = 0xffffeb0db06ef160,

Reading the KVM source code, specifically kvm_arch_vcpu_ioctl (virt/kvm/arm/arm.c) tells us that KVM stashes a pointer to a “struct kvm_vcpu” containing VCPU state in the private_data pointer of this open file. We can find the private_data pointer in the above output and dereference it:

struct kvm_vcpu -x 0xffffeb0db06ef160

We can use this to reconstruct and obtain any state we want from the KVM instance by reading the structure, paying particular attention to the “arch” sub-structure, and such sub-sub-structures as “fault”. It contains a copy of various registers used by the hypervisor to identify the reason for a fault. Of these registers, a few of them are particularly interesting:

esr_el2 = 0x92000086,
far_el2 = 0xb69b4138,
hpfar_el2 = 0x3deb90,

The first is the Exception Syndrome Register (ESR_EL2). It tells us information about why we aborted into the hypervisor (more on that later on). The second Faulting Address Register (FAR_EL2) tells us what guest address lead to the trap. Unsurprisingly, this is the offending address we looked at previously. Lastly, we have the Hypervisor (IPA) Faulting Address Register (HPFAR_EL2). This one contains the address in guest physical memory that caused the trap.

I spent many more hours than even might be inferred from the above manually walking through various parts of the guest and host memory. One of the things I did was to walk through the guest’s page tables for the currently running process within the guest at the time of the fault. I did a lot of this manually, by taking the page table base register for guest userspace, known as the Translation Table Base Register (TTBR0_EL1) and performing the dozen or so operations needed to walk down to the individual PTEs (Page Table Entries) that describe the guest physical address for a given virtual one.

Eventually, I realized that when we received these traps into the hypervisor, the HPFAR_EL2 register always contains an address that is close to that of the guest physical address of the last level PTE leaf needed to perform the translation from the guest FAR (faulting virtual address) into a guest IPA (guest physical address). The only difference was that the value within the HPFAR_EL2 register was truncated, missing the upper bits of the guest faulting IPA. 0x43deb9000 became 0x3deb9000, dropping the “4”.

Finally, I hypothesized that what was happening was that guests performing a stage 1 page table walk were hitting a faulting condition (e.g. host page backing the PTE needed an access bit update that is software managed in Armv8.0) during the walk which required a trap into the hypervisor in order to manage the page table update (e.g. access bit update). In that case only, there were occasions in which the value contained within HPFAR_EL2 might have been truncated, missing the top address bits added by the Arm LPAE.

My solution? In the case that we know we received a hypervisor abort due to a stage 1 translation (which is helpfully communicated to us in the ESR_EL2), we can reconstruct the value that should have been in the HPFAR_EL2 by manually walking the guest page tables and comparing the leaf nodes. If the leaf matches the faulting address minus the top bits, inject the correct address instead and allow the guest to continue its page walk, successfully this time.

We haven’t fully run this to ground yet, so it’s not clear whether there is a hardware errata, or some software issue we haven’t identified in Linux somewhere that is causing this odd truncation in certain unusual cases, and even if there were a hardware issue, there is hopefully a fairly straightforward workaround without needing to reconstruct the faulting address as above.

The purpose of this blog isn’t to point blame in any particular direction. Especially since we don’t know the exact circumstances of any bug or exactly how it is triggered. But I did want to write this up while it was still fresh, to aid those who may follow since it was a particularly interesting one in my mind.

Computer Architect