Root Causes 325: Certificate Error Causes Sharepoint Outage
A recent outage in Microsoft Sharepoint was caused by an error in certificate installation. We explain what happened and the lessons to be learned.
- Original Broadcast Date: August 11, 2023
Episode Transcript
Lightly edited for flow and brevity.
-
Tim Callan
I am looking at an article. This was written by Lawrence Abrams. It's in Bleeping Computer. The date is July 24, 2023 and the headline says it all - Microsoft SharePoint Outage Caused by Use of Wrong TLS Certificate. So what happened here, Jay?
-
Jason Soroko
According to the article, and this is what we're going by, it looks like there was a short outage. I think it was only 10 minutes.
-
Tim Callan
Microsoft claims 10 minutes. Other people think it was longer, but that might have had to do with systems catching up and propagating. So go ahead.
-
Jason Soroko
It looked like a .de top level domain version of what Microsoft was, you know, a domain that was protected by a certificate for the .de domain was added to .com servers.
-
Tim Callan
Yep. Just plain put the wrong certificate on the wrong machine.
-
Jason Soroko
That is the long and short of it, which will cause an error exactly that was posted in the article, which is, hey, you know, a bad guy might be trying to intercept your traffic and because of the fact that the domain you’re browsing to, is not covered by the certificate that is in between on the web server.
-
Tim Callan
This is TLS working correctly, right?
-
Jason Soroko
This is.
-
Tim Callan
The certificate installed on the wrong machine is exactly the same from a TLS perspective as the person impersonating you using the wrong machine. It's the exact same thing and under those circumstances, this is exactly what TLS is supposed to do.
-
Jason Soroko
It is exactly what it is supposed to do. The browser did what it was supposed to do and obviously some people saw an outage and an outage on those kinds of systems we're talking about OneDrive, and some of these, SharePoint, and it's like so many enterprises are using that, Tim.
-
Tim Callan
How many human beings are using SharePoint at any given time. Exactly.
-
Jason Soroko
It's a huge amount of humanity. A surprisingly huge amount of humanity, I think. It’s an incredible thing. But, I think the heart of this, Tim, comes down to well, how could a mistake like this be made? Well, I dare say, because I don't have the facts and figures, but this looks like human error. Meaning the certificate change, the certificate, the registration, the issuance, the renewal all the things that are necessary here, but basically, the installation of the cert on that web server was probably done manually.
And, we're talking about just human error here, meaning automation could have helped you, especially if that automation configuration went through some sort of double check procedure. But, certainly, this looks to be human error without automation.
-
Tim Callan
And we don't know for sure, and we'll never know, but why don't we make that the working assumption for the rest of this episode? Because that seems very likely, like, that's the explanation.
So then you kind of come to well, like, this is just a very tidy example of all the stuff we're always talking about. Like, if you want to talk about a company that knows how to get IT, you can't do much better than Microsoft. And yet, here we are.
-
Jason Soroko
Especially on one of their most important sites that exist. I mean, Microsoft has a lot of sites, and this is probably one of their key ones and, for a mistake to be made, I would say to any of you, who are in a company that has a web presence, which is anybody who has electricity probably has - - You know, it shows you that as we shorten certificate lifespans in the future potentially, Google's announcement, the 90-day announcement that they made and something we've talked about, I think that a lot of people might think you know I'm going to suck this up and we're going to continue with installing manually instead of once a year, we're gonna do that four or five, six times a year. Oh, my goodness, if Microsoft can't get it right sometimes, then what's it gonna look like for the rest of us if we're not automating? And we're not using Certificate Lifecycle Management.
-
Tim Callan
Right and to pick up something you just threw down. Not only just that this is Microsoft, but this is Microsoft on an insanely important part of the business. If this was some little obscure thing off in a corner somewhere that nobody really looks at, then you might rationalize that. I mean it still is embarrassing, but you might say, well, the risk/reward math is different. But what's the consequence of a 10 minute outage on SharePoint and OneDrive? Oh my God. And so in that regard, the stakes are extremely high, and this was an extremely high stakes portion of the business where this occurred.
-
Jason Soroko
Extremely high stakes. And I would say, if you're a small and medium sized business, a big market company, it doesn't matter. You don't have to be one of the big guys to be impacted in a big way by an outage caused by a mistake like this. And it's so easy to make mistakes like this. I mean I've installed certificates multitudes of times before, and I wish I didn't. It all comes down to I now know deep in my heart, Tim, and in my intellect, that if you automate this, let computers do what they do best, which is your job now is to configure it properly, and test it and then once that's done, you can have some level of confidence that you won't fat finger it, and you won't make the mistake of using a wrong certificate, or let a certificate expire, or any of the other mistakes that you might make.
-
Tim Callan
I think this is precisely right. And the other thing, again, that you just said that I just want to make sure we put a little bit of spotlight on is, this is a mistake that anyone can make. So it's not even like yeah, somebody made an error, but we can't even turn around and attribute this to incompetence. What you attribute this to is, it's a task that is fundamentally error prone, that must be repeated a whole lot and where any error, even a tiny error, has vast consequences. That's basically the situation with certificates.
-
Jason Soroko
And it's interesting to me that in so many areas of IT and other aspects of computing, anytime that you have a problem that's exactly like you just described, Tim, high risk in a repeatable, risky event, where fat fingering is just so incredibly easy, you let a computer do it. You automate away the human error factor and it's not like the industry hasn't come up with ways to do that. So, this is a theme we've had over and over and over again on this podcast, and we have to repeat it. And now with the shortening of certificate lifespans, it just makes that - - imagine, Tim, going from you only having to deal with this problem once a year to now doing it five, six times a year.
-
Tim Callan
Absolutely. I'm pretty sure that at are 2022 lookback, you and I predicted that we're going to have more stories about high visibility outages due to certificate errors. I'm also pretty sure that when we first introduced the idea of Google 90 day certs, we also predicted that even prior to 90 day certs, we were going to be having more stories about errors due to certificate, or outages due the certificate errors and indeed, both of those predictions have proven true.
-
Jason Soroko
Tim, I think the last one before this was an Elon Musk tweet an outage of Starlink, which actually affected me personally. And he tweeted straight up, it was a certificate expiry. A certificate had expired. It was not dealt with manually properly. There was no automation, and they were going to review that. There it was. Just another small non-technical company. Starlink.
-
Tim Callan
What did they know about IT? Anyway, there we go. And I think we've promised the listeners that when these things happen in high visibility circumstances with companies that definitely have the technical chops to get this right, that we're going to bring it to your attention, not to shame those companies, but to point out how hard it is. It's hard. It's hard. And here we've done it again.
-
Jason Soroko
Seek a partner and understand that the technologies to help you to get away from this problem do exist. We know what those things are. Come to us. We'll help you out.