There’s an old lightning talk from RubyConf 2019 that I love, about how there’s no such thing as knowing your computer “all the way to the bottom.” At no point does it stop being a stack of leaky abstractions.
An unsuspecting fellow emailed me in response, asking, “yes, but isn’t there one, final machine language at the bottom?” I don’t even think he was serious, he was just being snarky. But I’ve gotten that response from people who mean it too many times, and I wrote back a mini-novel.
There is, most certainly, a machine language. And I meet people who would argue precisely that. Here’s the thing: for a given processor type, there is an ‘absolute’ such language. But there are leaks in the abstraction. The spec-version isn’t exactly how it’s implemented for a given model of processor.
So we go deeper. Instead of the AARCH64v2 spec, which is inexact, we move to the spec for the AWS Graviton2 processor, which has weird quirks and differences in speed and errors in implementation. Unfortunately, there are still leaks in that abstraction, even if our knowledge of that model were perfect (and it isn’t): for instance, an individual processor is made by an excellent, but imperfect, process. As a result a processor, especially a large processor, contains tiny physical deviations from the ideal. Those deviations (tiny errors in the manufacture of transistors) can cause tiny deviations in how they act. There is an extensive verification process (proprietary to the manufacturer, not described in full detail to the public, but we can assume it imperfectly based on how other manufacturers do it and what they publish), and it catches a very high percentage (but not 100%) of all errors.
And of course, even the published spec is full of instances of “we’re leaving that dusty corner of madness alone.” How precisely do you need to maintain a specific voltage, to be sure there are only as many bit-flip errors as usual? Eh, let’s publish a number and assume nobody will ever try, and if they get there accidentally we’ll say, “well, we never said you could do that.” And if it happens in a certain percentage of “valid” cases, how likely is it that anybody will ever prove, with certainty, that that can happen? If it does, we’ll have the PR department issue a bland non-apology.
But this process, in addition to the leaks we can account for, has other leaks: the individual processor also ages, which means every individual transistor has a continuing risk of hardware failure, always. Processors are well-shielded, but not perfectly shielded, from cosmic and magnetic radiation. The designed paths of our processor silicon are resistant, but not 100% resistant, to all the various sorts of environmental noise. We have a variety of hardware checks (processor checksums, error-correcting RAM) meant to catch this problem a very high percentage of the time where it is hardware-detectable. A very high percentage is, alas, not 100%, so larger-scale software processes are often written to account for the more pernicious kinds of hardware failure (wrong results rather than simple no-result cases) because, at larger scales, one actually hits those cases. One of my employers, a video analytics company, hit bad packets that had been mangled but nonetheless passed the TCP/IP checksum, very frequently because at large enough scale that will happen many times a day. Google is in a similar situation with error-correcting RAM, and sees (shipped, production) processor bugs with remarkable frequency. They are different from the rest of us in that they have so much scale they can track down the specific bug instead of assuming, as the rest of us do, that it’s our fault and/or the computer is haunted. Of course, for the parts we can see, it’s our fault or somebody’s fault or the computer is haunted. But if you could get rid of all those problems, there are others that remain, and at Google’s size you can often see the pattern of them in the ‘random’ waves of failures that other people never notice.
It is not merely that we have human errors from hardware designers down at that level. It is that every level, from the relatively controllable (in the Pentium FDIV bug, one bit was wrong in a giant lookup table, so certain multiplications were off by one bit) to the utterly mind-bending (carefully shield the ‘unused’ bits of your processor, because patterns in electricity will create random bit-flips at a distance) has things we genuinely cannot fix in the mixture. It is merely that, as you descend a level deeper into the madness you can see a different set of them.
At no point, not all the way down to quantum physics, does it become fully regular with no strange parts that defy human understanding, not even if you use raw machine code, not even if you know all published specs of your processor, not even if you do a full set of manufacturer-style verification tests on your specific and individual processor. At every level, you apply a process to make it more likely, but not fully likely, that you’ll catch most of what’s going on. And you do it by zooming in until the set of things under your supposed control now includes a lot of parts that are weird, or wrong, or designed incorrectly, or a complex tradeoff with no right answer that was hidden from you, or literally beyond the collective human ability to understand or (most likely of all) a blend of these which, until you zoom in even further, is simply marked “it’s complicated” or (far more likely) marked incorrectly as something simple.
Isn’t “raw machine code” the bottom? Or maybe “raw machine code for one specific processor model” or “raw machine code for my unique individual processor?”
Well, those are places where a person might choose to stop thinking about the problem, if they wished. All three of them, and many similar places besides.
I have this debate periodically with people. Inevitably, the ones who want to keep arguing, give some variant of “yes, but that’s the level where I want to stop thinking about it” or “beyond that it’s too hard for my use case,” with the unspoken addendum that I’m rude to point out the leaks in their abstractions.
Great. They need to stop acting like that makes them more noble than a programmer who gets down to C, or PHP, or Ruby, or shellscripts and says “yes, but that’s the level where I want to stop thinking about it.” There is no true, correct, bottom level where the abstractions stop leaking.
Or if there is, theoretical physicists are far from reaching it.