It wasn’t an ordinary week at Linux creator’s Linus Torvalds house in Portland, OR. A snowstorm had knocked out power to Torvalds’ home for the better part of a week. Despite that, Torvalds still got the first release candidate of the latest Linux kernel 5.12 out the door. That turned out to be a real mistake. The release, which was meant only for people who are testing the Linux kernel for bugs, turned out to have a bug for the ages, which would wreck test systems. Now it’s been fixed.
What happened, as Torvalds explained on the Linux Kernel Mailing List (LKML) was a “double ungood” mistake, which could wipe out a computer’s filesystem.
This blunder, Torvalds said, started with “a very innocuous code cleanup and simplification that raised no red flags at all, but [it] had a subtle and very nasty bug in it: Swap files stopped working right. And they stopped working in a particularly bad way: the offset of the start of the swap file was lost. Swapping still happened, but it happened to the wrong part of the filesystem, with the obvious catastrophic end results.”
In other words, when you’d run the release candidate code and you ran out of memory, your computer would do what it was supposed to do and write idle data and programs to the swap file. So far, so good. That happens on busy Linux systems every second of the day. Here, though, instead of being written safely to the swap file, data was written on top of your existing files. Thus, with this bug, your computer could shortly come to a complete and utter stop.
Or, as Torvalds put it, “you can end up with a filesystem that is essentially overwritten by random swap data. This is what we in the industry call ‘double ungood.'” That’s for sure!
Torvalds continued, “It really wasn’t a very obvious bug, and it didn’t even show up in normal testing, exactly because swapfiles just aren’t normal. So I’m not blaming the developers in question, and it also wasn’t due to the odd timing of the merge window, it was just simply an unusually nasty bug.”
The master Linux developer wanted everyone to know about this bug because while “rc1 tends to be buggier than later rc’s, we are all used to that, but honestly, most of the time the bugs are much smaller annoyances than this time.”
Torvalds warned “most of our rc1 releases have been so solid over the years that people may have forgotten that ‘yeah, this is all the new code that can have nasty bugs in it.'”
Most troubling of all, some people have gotten so used to rc1 being mostly reliable that they say to themselves “‘Ok, rc1 is out, I got all my development work into this merge window, I will now fast-forward to rc1 and use that as a base for the next release.’ Don’t do it this time. It may work perfectly well for you because you have the common partition setup, but it can end up being a horrible base for anybody else.”
That turned out to be the case with Intel. Intel has been using RC1 in its graphics continuous integration (CI) systems. The result? Trashed file systems. Ouch!
Now, to fix all this, Torvalds has pushed out the next release’s source code, Linux 5.12-rc2 early. This was primarily because of the swap problem. Moving forward all should be well.
But take this as a fair warning. A release candidate is, well, a release candidate. You shouldn’t be using it on production systems. It’s very rare for things to go wrong with a Linux rc, but, as this case shows all too painfully, it can happen.