Root Causes 383: Delayed Revocation Events by the Numbers
An epidemic of delayed revocations has infected the public CA community. We track delayed revocations since the beginning of 2021, examine the trend line, and discuss root causes.
- Original Broadcast Date: May 2, 2024
Episode Transcript
Lightly edited for flow and brevity.
-
Tim Callan
Okay, so we have been discussing quite a lot over the last eight episodes or so, the incredible goings on in the world of public CAs. And, you know, there's been a lot of attention right now, a lot of CAs are being found to have a lot of compliance problems, and there's been a lot of public dialogue about it. It's really kind of a big deal. I don't want to rehash all the past stuff. But just so the, that the - - listeners, if you haven't followed all of this, we had our Episode 378 Drama on Bugzilla, our Episode 372 Bugzilla Bloodbath, our Episode 377 Is CPS Issuance Misalignment a Revocation Event and our Episode 378 Why Are Forced Revocations so Difficult? And you may see, especially out of those last two, that there's been a lot of focus in this whole saga - which I encourage you to go back and listen to to get the full story of - there's been a lot of focus on the idea of revoking misissuance certificates or not revoking misissuance certificates or revoking them but taking a very long time, essentially taking more time that's been involved. And that has led to a syndrome of delayed revocations, and a lot of issues posted Bugzilla bugs around the concept of delayed revocations and a whole lot of debate and dialogue around that. So that's kind of the baseline.
And what we wanted to do was, we wanted to do a little bit of research and really put some numbers on this. And so I want to thank my colleague, Martijn Katerbarg. Martijn is pretty well known inside of the web PKI industry. He did this research. So I don't want to take credit for Martijn’s research, but I do want to share it because it's really, really interesting. And what we have here, Jason is, what we did is we went back and we tried to track, we tried to put numbers on how has the frequency of these failure to revoke episodes, as have been publicly reported and acknowledged in the Bugzilla platform, how is that different now from kind of the normal baseline? That's what we're interested in understanding. Does that make sense?
-
Jason Soroko
Makes sense, Tim. Look, my experience with this is, it has happened from time to time but we seem to be hearing more about it, Tim. We've certainly podcasted more about it. You seem to have some facts and figures that really substantiate that.
-
Tim Callan
And I've been walking around saying to people, oh, well, this is out of the norm. This isn’t what it is. And we said, well, let's put some numbers on that and test that idea. And just spoiler alert, very much this is out of the norm. So here's how the methodology works. Every Bugzilla bug has the opportunity to be to be tagged. And one of the tags is a delayed revocation tag. And so if every bug in Bugzilla was current tagged correctly, which I can't vouch for, because I didn't go look at all of them, but assuming they were, then these numbers would be perfectly accurate. I think these numbers are mostly accurate. They may not be perfectly accurate. But when you see what the trend looks like, I'm very comfortable saying even if there's a little bit of not quite reporting things right in Bugzilla, it's not going to, like it's not going to change the trend. So that's the basic methodology. So we created a script, and we just got some outputs and what I want to do is I just want to read - - this goes quarter by quarter and we're going back to the start of quarter one and I want to go quarter by quarter and I want to tell you how many new delayed revocation Bugzilla bugs were opened against CAs in the quarter. Okay. That's when it was opened. Does this make sense, Jason?
-
Jason Soroko
Yeah. So number of bugs that have been labeled – -
-
Tim Callan
New.
-
Jason Soroko
New bugs that have been labeled as the missuance - -
-
Jason Soroko
As the failure to revoke on time. Delayed revocation is the name of it. It’s failure to revoke on time, which by the way, according to the definition, includes failure to revoke ever, which is important because we have two of those open right now, where the CA has just said, no we're just not going to revoke them. We're just not gonna. Right. So those are all lumped together in the delayed revocation tag. They all get the same tag, right? They're categorized the same way. So with that, shall we begin?
-
Jason Soroko
Let's do it.
-
Tim Callan
Okay. So starting in Quarter one of 2021. So this is for the year 2021. Quarter one. Seven new bugs opened. And Quarter two, four new bugs opened. Quarter three, two new bugs opened. Quarter four, four new bugs opened.
Now we move on to 2022. Quarter one, three new bugs opened. Quarter two, zero new bugs opened. Quarter three, zero new bugs opened. Quarter four, six new bugs opened.
Now we move on to 2023. Quarter one, two new bugs opened. Quarter two, zero new bugs opened. Quarter three, four new bugs opened. Quarter four, six new bugs opened. Okay. So we see kind of, that's averaging like 24.
-
Jason Soroko
One a month.
-
Tim Callan
Yeah. Three or four a quarter, let's say. Now, Quarter one, 2024. 14 new bugs opened.
-
Jason Soroko
Okay.
-
Tim Callan
Quarter two 2024. Now, I will note that it is still April, when you and I are recording this. So this is less than one-third of the quarter. We’re three weeks in, right? Quarter two, seven new bugs open. So if you were to multiply that by three, we'd be trending to 21. So it's growing. And basically in the first four months, not quite four months of the year, we have the same, more or less the same number of bugs, as were opened in entirely the two previous years. So that puts us at about 6x the normal pace right now.
-
Jason Soroko
Thank you for that, Tim.
-
Tim Callan
And trending up.
-
Jason Soroko
Yeah. Like really trending up too? It's hockey sticking in its own way. So I'm going to ask the obvious question, Tim, which is, what the heck is going on here? Because there's obviously something going on. Now, you said something in a previous podcast that I'm going to hook onto and I'm going to ask you whether or not this is related or if it's just something you said. And that is one of the active members in the Bugzilla forum ran a linter and a whole pile of cheese fell out of that.
-
Tim Callan
Heck yeah. Absolutely it’s Jason. So, gonna give you - - I'm gonna say - - I'm gonna give you a two part answer on the root cause. And the first part is what you just said. So a technically astute individual in the desire to get a sense for how much perhaps unreported non-compliance was going on out there, started running linters against just random samples out of CT logs. And as this individual began finding results of misissued certs, he just sent reports to those CAs and said, hey, by the way, here's a misissued cert. There you go. Go do your thing. And that led to what we've been calling the Bugzilla Bloodbath.
That led to this giant surge in bugs in Bugzilla, in general. And I think that's a valuable and important point that we may want to return to. But the first thing that happened was, there were a bunch of new, kinds of new misissuances being discovered, and they were incidents being discovered against the kind of CAs, against a group of CAs who weren't used to having incidents reported against them. And that's a valuable point, too. We're seeing names we don't usually see. These are very specific, I'll call them niche CA's. They're serving some kind of niche market, usually a geographic segment, but not necessarily and they're very specialized. And they're just not the sort of people who are undergoing daily scrutiny the way a very large CA like Sectigo or Let's Encrypt might be. Right? And so that was the first part because there's just more bugs and because there's more bugs out there, to some degree, if the failure to revoke stayed consistent, then you would expect it to grow. But there's not six times more bugs and you wouldn't expect it to grow six-fold right.
So the second piece of that is what I just got to. It's the set of CAs that got these bugs written against them. And it's not a group of CAs that is typically doing this kind of thing and I think as a consequence, we've seen a lot, not all of them, not all of them by any means. Some of them responded very well. But we've seen a lot of these CAs have dysfunctional responses to getting this kind of report. We saw another rash of bugs in people who didn't respond to their certificate report correctly because they didn't know how. Because they didn't know how, because they didn't have an established practice. They kind of weren't expecting this. They were sort of, you know, living their life out of the public eye. And they weren't really expecting to get a certificate report and when it came in, they didn't realize that got it, or they didn't know how to deal with it. So we saw a big bunch of failure to respond to certificate reports correctly. And then many of those CAs also turned around and failed to do the revocation correctly. And so that’s a valuable and important point but I don't think that's the whole story.
And I think if we just left it at that we'd be missing a very important part of the story, which is the nature of the reason that they failed to revoke on time. And what I mean by this is, there's a few reasons a CA might fail to revoke on time. They might fail to revoke on time because they have some kind of technical failure, or procedural failure where they don't get it right. They don't know what to do. They run their software, but their software has a bug, and the certs don't get revoked, or something like that. We've actually had that in the past. I remember a revocation event within my span at my job where there was a bug in our revocation engine and was supposed to revoke the certs and it didn't. And we were late because we realized it got stuck and we went and we did them manually, and we wrote ourselves up. So that kind of thing can happen. But if you look at the open. At present, there are 22 open delayed revocation bugs. So of those 22 open delayed revocation bugs, 18 of them were a deliberate decision by the CA not to revoke on time, as opposed to a mistake of a technical or procedural nature. It wasn't a CA who intended to revoke on time, and it didn't work out. These are CAs, who just plain decided to be late.
-
Jason Soroko
cim, there is a really important point. There is late because, oh geez, our systems just didn't do it. There was a real technical problem, but we have every intention to. Level two, which is, oh, my God, we are an insufficient CA. We're a real small player. We've never dealt with this before. We'd love to, but we just don't know how to do it in time.
-
Tim Callan
We just didn’t get our act together for whatever reason. We made stupid mistakes. And we don't usually do this and shame on us. And that's not good either, but it's not good in a different way. Right? You and I have talked about this in the past. We've talked about it with some of the CA distrust episodes. I remember we talked about this a lot when we talked about TrustCore being distrusted. There are two kinds of potential problems with a CA and one is competence problems. They don't have the technical capability to do the things they need to do, and the other ones are integrity problems. They don't choose to do the things that they're supposed to do. And when a CA just decides to willfully fail to revoke on time, I categorize that as an integrity problem, not a competence problem.
-
Jason Soroko
Got it. And then there's this third category, Tim, right? And I gotta tell you, I am more than a little surprised - and I'm trying to be polite here – about CAs, who have the capability and, you know, weren't the originators of self-reporting, didn't find it themselves and also are saying, No. We're not going to revoke.
-
Tim Callan
And then yeah, so, and then the two open incidents right now of we're not going to revoke all, have been categorized in my numbers that I'm giving you today as these are called late revocation, right. And if revocation is, you know, not until 399 days have passed, and they're all expired anyway, then that's still late. And so, you know, either way, those are in there, but those count among the 18 of the 22 that have just willfully decided not to revoke on time. Now, I can do one step further in terms of the numbers. So of those 18 that decided willfully not to revoke on time, a majority of them, 13 out of the 18, have said that there is no security impact due to the late revocation event.
So, again, we're seeing this trend, right? We're seeing this extreme homogeneity in terms of the responses to these events to the point where we have a very typical - - More than half the time, we see a typical pattern and the typical pattern goes as follows. CA has a certificate report made against them about misissuance that they didn't self-report. CA goes and looks into it, writes up a Bugzilla bug. After the reports have been made against them determines that there is some number, a positive number of misissued certificates - and by the way, sometimes these numbers in the 10s of 1000s, Jason, just to be clear - there is a positive number of misissued certificates and then the CA decides that they will not force the revocation timeline, which in every one of these 22s is a five day revocation by the way. There's no one day revocations in this mix. So will not get it done within 120 hours, even though they technically could and then the CA turns around and their justification for that is that this does not have any security impact. And then again, the common thread that you see throughout this is they will turn around and say the consequences to the ecosystem of doing this revocation before these certificates are replaced, are greater than the consequences of sticking with the misissued certificates. So this is a very clear pattern.
-
Jason Soroko
Wow. Okay. So, Tim, in terms of that, that's not usual. Is it? I mean, you're probably closer to it than I am but I'd love your opinion just on, okay, so you’ve given us the numbers. Those numbers were clearly trending towards something different right now. You've broken down what the meaning of it is. You’ve broken down what the intentions of the CAs are. That more, you know, I’m trying to choose the word carefully here - - the CA's that are kind of bucking the system a bit. In terms of seeing that before or what it might lead to, can you give us some more color what's going on there?
-
Tim Callan
Yeah. Well I think - - so part of the reason that I think that we see this growth trend, and this clear pattern, it's not just that a whole bunch of new bugs were written up against CAs. I think it's also that CAs watch what other CAs do. They're supposed to. It's actually one of the rules. One of the rules is you're supposed to watch what happens in the whole ecosystem and every time there's an incident anywhere, you're supposed to ask them, how does this inform me to make me more knowledgeable as a CA. So any CA that isn't reading all these bugs and digesting all these bugs is being irresponsible. And you can see clearly that they are because everybody is singing from the same hymn book. So I think what you're seeing is the CAs that come to the - - early on, one or two CAs decided, we're going to try this we're not going to revoke thing because we don't want to do that. For whatever reason. It's going to make our customers unhappy, and etc. And then that happened. And other CAs looked at it and said, that's a good idea. I think I'll do that, too. So we are having CAs learning from CAs, which is the point, which is why we have the mechanisms we have. Unfortunately, what's happening is they're teaching each other bad behaviors.
-
Jason Soroko
Interesting. So here's a question then. The long tail of the smaller CAs, do you think some of them are seeing some of the larger CA responses and saying, maybe I can hide behind some of that?
-
Tim Callan
Yeah, and I think this has been identified a lot in the public dialogue is two things. One is absolutely. There's one high visibility CA that's drawing a lot of fire right now and to some degree, it's serving as a useful smokescreen for other CAs, who they say I'm really small, I don't have that many active certificates. Nobody really cares. As long as I don't call attention to myself, I'll probably get away with it. Right. I think there's absolutely an aspect of that. I also think there's an aspect of just more generally saying, I don't believe that there will be a negative consequence because there's not enough granularity and enforcement.
So one of the things that's been discussed a lot at the public level is that traditionally, browsers have taken one of two actions, which is they continue to trust you, or they distrust you. So it's the equivalent of the death penalty. And the problem is if the only punitive measure you have is the death penalty, and if you don't want to be some kind of genocidal sociopath, then all kinds of crimes go unpunished, right. And if you live in a sensible society that is not going to punish you for running a stop sign, then at that point, there's no consequence to running stop signs. And as a result, a lot of people start running stop signs. Right? And this is what you're seeing now. You're seeing people looking here and they're saying, you know, I don't think I'm gonna get distrusted over this one incident and there's no percentage, there's no upside, in revoking the certs on time and pissing off my customers, so I'm just not gonna because I will get away with it. And sadly, unless something changes they will. And so there's, again, been a lot of public dialogue to say, perhaps what we need is more granularity in the response. Perhaps we need a way for the misissued certificates that are not being revoked on time for the damage of those certificates to be mitigated inside of the ecosystem on a decision that can be taken on the browser side without requiring the revocation from the CA and at the same time, this might have a demotivating effect on CAs for future misbehavior. If that makes sense.
-
Jason Soroko
Yeah. Tim, I'm going to sum up my final thought just hereby repeating a lot of what was in podcast Episode 380. And really what we meant in that is my same exact thoughts right now, which is, we probably wouldn't even be having this conversation if most certificates, most or all publicly trusted certificates, were managed certificates, meaning managed by a certificate lifecycle management system, including automation.
-
Tim Callan
Absolutely. Absolutely.
-
Jason Soroko
Number two, if we had lifespans of certificates that were in the range of 90 days and 10 days and as well, if we had the thinking amongst ourselves that misissuance is inevitable - - like I love the numbers you quoted, because it shows, yes, there were quarters that were zero, but that wasn't the majority.
So in other words, misissuance is inevitable, and it is somewhat constant. And so therefore, with that in mind, the pain of mass revocation, which was the whole point of Episode 380, you can mitigate it right now.
And I hope that that's what the future is is, you know, regardless of how this all falls out with the browser trust of the different CAs that are playing these games, I think that the end customer of public trusted certificates, needs to employ certificate lifecycle management, employ automation, and embrace shorter certificate lifespans.
-
Tim Callan
Yeah. I agree with all of that above and I'm gonna make one more prediction, which is, I do think something is going to change in the web PKI because if nothing does change, then this is just going to become the new normal, because people are going to look and say, hey, well, now I have no good reason to stick with this revocation timeline so I'm just not gonna. And, yes, maybe someone could get distrusted here or there, but, you know, that's an extreme response. And so I do think there will be some form of updated mitigating responses that are in the browsers’ toolchest that are more than what we have today. I do think that's coming. I don't know what form they will take. I don't know when, but I do think that's coming and when they do come, I do think that the web PKI will be stronger for it and that will be a net improvement to the whole system. So you know, when that happens, obviously, we will let you know. That is my prediction, and we'll see if I'm right.
-
Jason Soroko
Awesome, Tim. We will stay tuned on this very closely.
-
Tim Callan
Okay. Thank you very much, Jason.
-
Jason Soroko
Thank you.
-
Tim Callan
This has been Root Causes.