Cool debugging story time!
A few months ago, I was writing an integration test for the Nextcloud integration with OnlyOffice, to be run in a NixOS QEMU virtual machine. Curiously, I saw corrupt filesystem reads interfering with the test. That was the start of a journey that eventually led me into the core internals of the Linux kernel.
The issue seemed to occur under very particular circumstances, and was I convinced it likely only affected a very limited set of users - but I was determined to get to the bottom of it. I added diagnostic logging, which confirmed the issue appeared to be inside the Linux kernel. Sadly, when I attached a debugger, all it showed was gibberish.
With this challenge top-of-mind I joined the
@why2025camp Dutch hacker camp, a great place to meet interesting people. Here I bumped into
@raitoRaito Bezarius who pointed me to a thread on the Linux kernel 'regressions' mailing list where several people were running into a suspiciously-similar problem.
Encouraged by the fact that the impact might be wider than I originally thought, I persevered and managed to get accurate insight from the debugger - revealing suspicious circumstances in the memory management data structures.
With the problem pinned down like this, I felt comfortable sharing my findings on the 'regressions' list. What’s more, I could share not only my description of the situation, but an actually-working Nix configuration demonstrating it. With that, maintainer Dominique Martinet could reproduce the issue within hours and write a preliminary fix overnight.
What looked like a bug in an obscure corner of QEMU turned out to be a subtle issue in the Linux Kernel Memory subsystem that, had we not found it, may have caused countless hard-to-diagnose problems in the future. Happy we could nip it in the bud!
You can read the full story (with many more technical details!) on my blog at https://arnout.engelen.eu/blog/linux-kernel-adventure/
#linux #kernel #qemu #debugging #programming #nixos #OpenSource