A professor with whiskers and a tall hat bows to you.
Now Class, Let's Talk about Incentives!

In the age of using internet sites for important things – communication, say, or banking – we’ve grown status pages, to let companies know whether their service is currently working.

In fact, a lot of companies started growing external status pages, so other people could tell you whether their site was working. That’s a lot of what let status pages happen at all. If you don’t provide it, somebody else will.

Why is it so bad for somebody else to provide that? What’s different about a company’s official status page?

As I write this there’s an ugly Discord outage which is barely acknowledged on their status page, so it’s a great time for me to talk about that.

Experts Agree, Everything is Fine

Here’s a weird thing: company status pages tend to far downplay an outage. If people are showing up because they think something is wrong, what advantage do you get from lying to them? For instance, here’s what Discord is showing as of a moment ago:

a status page showing a recent incident with API error rate, and below it a graph showing no problems, now or past, with APIs

Notice anything odd? They’re showing an incident with API error rates, but show no changes in the API uptime. So then “error rates are bad, people can’t use it” doesn’t count as affecting the uptime.

I’m not trying to pick on Discord here. If you look at nearly any company during an outage they do this. They tell you everything is fine, and has always been fine, even mid-outage. They also put up the “we found a problem” notice long after the problem is found – usually after an article has been published and passed around on social media – and put up a “we solved it, everything is fine” message before it’s solved. Here’s Discord’s “everything is fine” notice, from lower on the exact same page at the exact same time, while at the top it (correctly) says people are having problems:

a status page showing the API error incident as being fully resolved 20 minutes after it started, while it's still ongoing

As I wrote this they (finally) put up something acknowledging (some of the) widespread problems with this new update, in a similar downplaying tone.

Here’s the thing: people go to this page because they already know Discord is screwing up. They can instantly communicate with a whole world of people to verify that, yes, Discord is screwing up. Third-party status pages like DownForEveryone say that, yes, Discord is screwing up.

Is Discord that oblivious to problems, such that external status pages are that far ahead of their own assessment? And so is every company with a status page?

Of course not.

Down For Everyone Or Just Me?

The first popular status pages, things like Down For Everyone Or Just Me? did what that sounds like. You want to know if the problem is just you. It’s still decent, and things like it often work well. There’s not a ton of ways to make money providing this service, so mostly you have to rely on the company for this kind of information.

SREs are the other big cause of status pages. If you’re dealing with a vendor, and you’re reporting problems, it can be frustrating that they don’t respond to you, saying yes, they know about the problem and are working on it. And of course they’re not going to give you regular updates – why would they? You’re not their boss. And saying to a customer “oh yeah, we’re not working right now” is something the company can be sued over. It’s much safer to not say anything. In fact, most companies will tell their employees absolutely not to say anything like that, and instead to direct queries to the PR team, media team or similar.

A status page is a way around that, somewhat. If your operations folks can put up a status page, that updates customers, it tells everybody, and it doesn’t require extra communication during a crisis. You can just leave the status page up and everything should be fine.

Ops folks, in my experience, are famous for being tactless, direct and pessimistic. Their internal dashboards are often accurate and unadorned. They pride themselves on knowing when something isn’t working. Their history graphs are full of blips and ups and downs, showing when there are minor delays, or normal failovers.

Nothing like a public status page, in other words.

So what’s going wrong? Why is the public information so bad? Why is the only permissible status page a wall of green for all time, even during outages?

Liability and Public Relations

I’m checking the status page again now, a few minutes later, and the problem with IP addresses being widely blocked has been removed as though it never existed. It’s a real and current problem, so they’ve replaced it with the “all fixed” problem with API errors, which doesn’t show on their graphs and was fully corrected hours ago, but has somehow miraculously re-opened, but only on one part of the page, not the other.

It’s still down, of course. Has been for hours. This is not a case of slightly different points of view. By any reasonable metric they’re outright lying. The various other third-party downtime detectors are all well aware that they’re having problems, and Discord clearly does as well.

Here’s the problem: a status page, being public, becomes a weapon of public relations. The company wants to convince you that they’re reliable. They believe that a graph, from them, acknowledging problems, is less okay than them lying constantly through every outage and error.

Presumably they’re right. If every company does it, that’s normally a sign that it’s the right – or at least profitable – thing to do. A “best practice,” as they say in the business.

You may be certain, looking at these pages, that the PR teams have won over the operations teams when it comes to informing the public about problems.

Here’s an interesting piece of fallout from that: you can’t trust any of these services to be reliable.

(“These services?” At a minimum, I mean any company with an all-green status page history. But really it’s nearly any tech company. I could mince words about some companies publishing some percentage of the postmortem reports that they promise, but I won’t bother.)

You can’t trust them to be reliable, because they clearly believe that simply lying to you about being reliable, and pointing you at the page full of lies, is sufficient to project the image of reliability. And again, this is a nearly-universal practice, so the Occam’s Razor assumption is that they’re right.

But the Efficient Market Hypothesis?

You might reasonably say, “how could a whole industry lie about this? Wouldn’t they get caught?” I tend to think of Kyle Kingsbury, a.k.a. Aphyr, and his many-year exposure of how bad the distributed storage industry is. Dan Luu has a great writeup here if you want the details. I was paying attention as it happened – many of us were – and while we cheered from the sidelines, it was obvious to many people before him how bad most of those products were.

Fundamentally, in cases where the company has all the incentives and it’s hard for a consumer to be sure about things, the company will lie just exactly as much as they can get away with. And that’s a lot. If it’s hard to disprove them, and there’s very little penalty for lying, why wouldn’t they?

The problem with, say, putting up your internal operations metrics for how well things are doing is that it’s worrying. You look much more reliable if you just put up a standard wall-of-green status page.

Which means that reliability is treated as only a problem to the extent non-expert, constantly-lied-to customers catch you at it.

If you ever need more reliability than that… mostly you’re out of luck.

But surely you can buy special, more-reliable products from companies where that matters? In theory, yes. For instance, you could imagine buying distributed storage products from companies whose entire business is providing that safety at scale. And Aphyr reminds us that if there aren’t third parties keeping them honest, mostly they will produce complete garbage, if you actually need reliability.

Lying is cheaper than telling the truth. The Efficient Market Hypothesis tells us that, if nobody is keeping them honest, companies that bother to really do the work should be out-competed by companies that take the cheaper “lie on our status page” approach.

An efficient market or company will be efficient at following its incentives, not necessarily at doing something you want.

the status page shows that, okay, some users are temporarily blocked, while still not showing any useful acknowledgement that things are down

But Why Do We Care?

You can easily say, “yeah, sure, companies talking in corporate spin, news at 11.” Which is reasonable, certainly.

But here’s the thing: your value as a software developer depends on keeping things working. If companies – if vendors – can’t be relied on to do that because their incentives put them in an ugly spot, that’s important for you to know. Yes, companies talking in spin, news at 11. Yes, companies being far less reliable than advertised, and lying to you about it, news at 11.

But that means you need to plan as though that were true.

Since your value depends on keeping things working, and keeping things working is harder than advertised, it’s important for you to think through that. How can you do more for yourself? How can you do more with components you can validate for yourself?

I plan to keep writing about keeping things working. There’s a lot to think through.

And here’s an important one: you can’t trust companies, or even industries, to tell the truth here. The Efficient Market Hypothesis is not coming to save you. You’re going to have to actually think through this for yourself.

Because otherwise, your stuff doesn’t keep working. And the value you promised doesn’t materialise.

I don’t want to be in the business of fooling suckers into paying me more than I’m worth. I want to be in the business of selling value. And in a world like this, that takes understanding. Let’s keep looking for understanding together.

As I finish writing this, they’re back to acknowledging the other problem (see above.) All the bars still show 100% reliability for the last 90 days, of course. Technically push notifications shows only 99.99% uptime. I’m not sure what was so bad on that one, down near the bottom of the graphs, that it was allowed to show imperfection.

Am I picking on Discord? Not really. I’m taking out my frustration on our entire industry as they “just follow best practices.” They’re not worse than the rest. And that’s the problem.