Root Causes 370: Drama on Bugzilla
An evolving incident on Bugzilla has garnered a lot of attention and touches several important issues in the WebPKI ecosystem. We report what went on and unpack the issues involved.
- Original Broadcast Date: March 19, 2024
Episode Transcript
Lightly edited for flow and brevity.
-
Tim Callan
Okay, so we're gonna do something that's a little unusual. We're going to talk about a breaking news item. And we are not really a real time news service, right? By virtue of the fact that it's you and I, and we do this podcast, and we have to edit and get it up and things like that. So we're going to try to turn this around pretty quickly but let me just state that this is being recorded on Tuesday, March 19, 2024. Because it is a point in time and as I said, this news is evolving. When you hear this, there may have been additional updates beyond what we cover here but we're going to try to turn this around pretty quickly to keep it pretty current. Does that sound fair?
-
Jason Soroko
Yes, Tim. That's super fair. What is going on?
-
Tim Callan
Okay. So, um, we're talking about an episode that's occurring right now on Bugzilla. Bugzilla, as we've discussed in the past on previous episodes, is something that's maintained by Mozilla, for its public CA community. So the public CAs with trusted roots in Mozilla are expected to monitor and use this online area called Bugzilla where incidents that affect public CAs where, let's call it compliance incidents, are recorded and discussed. And there's a whole process and set of expectations around this. And there's many, many, many incidents on Bugzilla. And I might actually call those bugs just because Bugzilla bugs make sense, right, but the official word for them as incidents, because they're not necessarily a bug. It could be a bug, like a software error, but it could be just something where the CA didn't manage to exactly meet the entire set of compliance requirements that are laid out, that are expected of a public CA.
-
Jason Soroko
Tim, what if people want to go off and do a Google search to have a look for this incident chat that's going on, is there is there somewhere you can point them to just in search terms?
-
Tim Callan
Yeah. So we're going to be talking about a real specific incident and you could search for - - if you searched for Entrust EV TLS certificate, CPSURI, that's one word C-P-S-U-R-I missing, you'll find it for sure. That's the name of this bug. And you'll drop into this incident in Bugzilla. And Bugzilla is available for the entire public to read. You have to have an account in order to post but anybody can go read this. So you can go read this for me, you know that we're discussing right now. This bug was opened about two weeks ago as of recording time. There are a total of 30 comments on it so far. And it's really as of the last week that the interesting developments have been occurring and have been occurring pretty fast.
So you know, Bugzilla is where people record these incidents and these incidents are really for purposes of the public CA community maintaining quality and getting better. And it's important to understand that there are a lot of extremely specific rules for CAs to follow. Those rules are evolving over time and that reported incidents in Bugzilla are very common things. There are new ones every week, and CAs self-report incidents or someone else discovers an incident and writes it up and it gets dealt with on Bugzilla. And this is common. It's normal. It's not a big deal for CAs to have incidents. There are many incidents that are open all at once at any given time and many CAs will have multiple incidents per year. If they are any kind of decent volume CA they'll have multiple incidents per year. This is just part of being a public CA. Part of being part of the web PKI is you work in Bugzilla and you deal with this stuff. So that's the first thing to understand.
And so this particular incident was posted by Entrust. Right there in the name and you know a little under two weeks ago as of the time of recording, it was opened on March 6, 2024, and it starts with an incident report, which is how these things often start, which is a specific, codified format that an incident report is supposed to follow with different things like a timeline, affected certs, lessons learned, etc. And these things are there in the incident and you're supposed to go through it, and it's published and the actual incident, the actual thing that occurs is very banal. Um, and what it is, is, there's - - in Extended Validation certificates - - according to this incident report, in Extended Validation certificates, there has been a failure to include a specific reference. It's a reference to the CPS and that's something that is required, and it's not there. And, in particular, it's required in the EV guidelines. The EV guidelines say that EV certs must do this. There is no equivalent of that in the standard baseline requirements. So there's no requirement to do that for an OV cert. There is a requirement to do that for an EV cert. And so that's the bug. And it's unambiguously a bug. And it's definitely not the way that the guidelines say and all of that starts out very clear.
And this bug report gets written up and it sits there and for almost a week, nothing happens with it. There's no response. There's nothing in the community, anything. And then finally, six days later, a couple comments show up and the first one says are you going to be including - - they're basically asking for things that are supposed to be in a complete report that aren't there. By the way, this also is common. Sometimes all the information isn't available, CA puts up a report and they augment it later, and they augment it later and that's also considered to be normal and acceptable.
So in this case, these questions come when it's almost a week later saying, hey, do you plan on including this stuff? It is certificate data for the mis- issued certs. It is number of certs mis-issued. Things along these lines. And then it follows pretty shortly with a question from another observer, which basically says, hey, do you guys ever stop mis-issuance? And so one of the things to understand here is there's an expectation that when the CA is mis-issuing certificates, what they're supposed to do is they're supposed to stop and fix the problem and then continue issuing certificates correctly. You're not supposed to just keep pouring non- compliance certificates out into the world. So these questions start coming – did you stop issuance? And hey, do you know that you're still actively issuing mis-issued certificates to this day? This is now almost a week after the original bug post.
And then where things start to heat up is there's a response from the CA, a response from Entrust, which basically says we have not stopped the issuance, and we are not going to revoke the affected certificates and the reason for that is that there's a conflict between the BRs and the EVGs. And the gist of this is, I said that the EV guidelines say that this particular field, this reference to the certificate practices statement is required, right. But the BRs have a line in them that says that it is not recommended that you include the same reference. So you could argue in that way that the two of them are in conflict with each other. Does that make sense, Jason?
-
Jason Soroko
Yes, it does, Tim and I can see how this this conflict between the EVGs and the BRs could cause an interpretation difficulty or whatnot. So that's not hard to imagine.
-
Tim Callan
Yeah, but here's the thing. It's not a true conflict. It's not like the EVG said, you must, and the BR said you shall not. That would be a true conflict. That could occur. But what this was is the EVG said you shall, and the BRs say we recommend that you do not and this is an important difference. You and I did an episode a couple years back about the difference between must and should in the BRs and the EVG requirement is a must requirement and the BRs requirement is should or in this case, a should not requirement and must is must. You must do it. And should is optional. So in the event that there's a must requirement versus an optional requirement, one of them is a hard requirement and the other one is optional. Right? And you could even rationalize real easily to say, well, the EV guidelines have lots of things in them that are guidelines for EV that are not included in the Baseline Requirements because EV is a more specialized form of cert and therefore it is actually perfectly sensible that there'll be a requirement for an EV cert that is not the same as a standard, as a non EV SSL certificate. And so the community pushes back on this viewpoint. And they push back on a few ideas.
One is the idea that this rationale actually holds water. The other thing the community wants to push back on is that this idea that the mis-issuance is still going on a week later, essentially, right? That a week passes and it's still going on, and that there's been a declaration that these certificates are not going to be revoked, even though they're mis-issued. And this picks up a lot of heat and we get a lot of people weighing in. As I said, there's something like 30 comments on this bug. A lot of bugs might live their whole life with three comments, right? There's something like 30 comments on this blog. It picks up a lot of heat in a very short time period. And everybody was very professional but, you know, it feels like people who feel very passionately about the viewpoint that this is not what a public CA should do.
And so, at this point, the narrative changes a little, which is that rather than arguing whether or not it's mis-issuance, Entrust shifts gears to say that we don't think it's in the best interest of the web PKI to revoke these certs. That the damage will be more than the benefit. And again, this is a debate we can have, right? Maybe it should. Maybe it shouldn't. But at this point, this debate is going strong. And it keeps right on going. And the response that essentially is well number one, you don't - - This feels very convenient, right, that because it's inconvenient for you and your customers, you don't want to revoke the certificates. And on the other hand, when it's somebody else, you have a different perspective, and you think you should want to revoke the certificates. And then, you know, another point, again, is that people keep saying, but look, the number of mis-issued certs is growing. You still haven't fixed the problem.
And so, you know, a little bit of side activity that's going on adjacent to this, that probably matters, or parallel to this I should say, number one is that Entrust very quickly introduces a potential new ballot for discussion about changing the wording of EV guidelines to match what the Baseline Requirements say, which is fine and that's a good thing to do. But it's being held up as a remedy for this problem. And of course, then the community comes back and says, wait a minute, hold on. You changing the rules in the future, doesn't change the fact that something was non- compliant in the past, right? Time only goes one direction. So you can't go and change the regulation tomorrow and say, okay, the fact that I was non-compliant yesterday is now erased. That is not how it works. You were still non-compliant yesterday, and a cert that was issued yesterday was still non-compliant and needs to be dealt with. And, you know, this isn't nitpicking. Rules change all the time and it's important to understand that the rules that are enforced when the cert was issued, are the rules that were enforced when the cert was issued. So that's not actually a Picayune point. It's a valid and important point.
Another thing that goes on is one of these commentators who's very active on this thread actually goes off and creates his own blog. The very first entry of the blog is a description of what's going on here on this particular thread. So you can see again, people feel passionately here and there's no real progress being made. And this goes on for almost two weeks. Mozilla itself weighs in with what its expectations are. And then finally, yesterday, the 18th of March, we get a message from the leader of the Google Chromium project that is very long and has a number of detailed comments about what Chromium’s expectations are and has a number of very detailed questions. And so at this point - - then that night, there is now an announcement from Entrust that says, and I'm gonna read this verbatim:
“We have stopped issuing mis-issued certificates and fixed the EV certificate profile. All impacted customers will be advised that their certificates will be revoked. We will create a delayed revocation bug and will follow up on other questions in the next few days.”
So like two weeks after the bug was created, and one week after this firestorm blows up, we get this declaration from Entrust and that more or less is where we stand right now.
-
Jason Soroko
Tim, thank you very much for that. Just a few thoughts that led to, and even a couple of questions for you. So number one, this retroactive nature is understood. Right?
-
Tim Callan
Yeah.
-
Jason Soroko
The mis-issued certs are out there. What’s the scope of the - - Just out of pure curiosity for everybody?
-
Tim Callan
Yeah. So, there's a lot of meat here and a lot to unpack. There's several ways we can go. I think this idea of retro actively changing the rules certainly is one of the things that came up here. This difference between shoulds and musts, and what is the definition of conflict and rules versus not really a conflict is another thing.
The scope of this is certainly part of - - surely part of the factor. So Entrust reports that there's something north of 24,000, in the ballpark of 25,000-ish certs and the number is moving around a little - It's not quite nailed down but it's in that ballpark - that are affected. That basically are mis-issued according to this incident. And they're all Extended Validation certificates and there's an implication that isn't made for sure that many or most of them are in the hands of large enterprises. And this could be part of the trouble. And some of the dialogue here is around this is a large number of certs, these enterprises can't swap out the certs in the time that they have to do so. And not doing so is disruptive to the relying parties that ultimately depend on these services. And it's bad. It's bad for consumers. Not good for consumers.
And so that's also part of this dialogue, because then again, the commentators online come back and say, well, wait a minute. What do you mean you can't revoke these certificates? As a public CA, you may have to. Right? What if there is a giant private key compromise or a zero day or at the equivalent of Heartbleed, or the equivalent of some of these other things that have gone on? What if you do need to revoke hundreds of 1000s or millions of certificates? What if - - and these enterprises, right? To say that, oh, well, these enterprises, they can't swap things out. They're not able to get it done in five days. The response to that is, well, you know what, this is a public facing PKI that governs the crown jewels that your organization has. Your large bank or your large enterprise. You better be able to swap your certs out. If you can't swap your certs out, that's a pretty serious problem. So that's in the dialogue as well and that's part of what's going on here, too.
-
Jason Soroko
Tim, in the past, we've done podcasts on a term I think you coined is certificate agility.
The need to be agile for cryptographic reasons, cryptographic agility. It all basically points down to me in my mind. Look, you said at the very top of this podcast, mis-issuances happened. There's all kinds of things on Mozilla that are discussed all the time. This isn't it, you know, we're talking about the story and how the response to this particular Bugzilla event went. But I think for those of you who are just thinking, what are the implications is here, I think there's maybe two, at least, that come to mind.
-
Tim Callan
Go ahead.
-
Jason Soroko
One is the importance of certificate lifecycle management. To reduce the friction of just that darn difficulty of a whole lot of certs need to be - - that were for mis-issued for whatever reason need to be swapped. The pain of doing it is brutal and one of the reasons why there still is a lot of pain is because certificate lifecycle management and automation are not everywhere at this point in time.
And then number two, Tim, just because it's staring me right in the face. I'm assuming these are mostly one year certificates?
-
Tim Callan
Well, they can't be longer than that.
-
Jason Soroko
That's right. That's right. And what I'm implying is that potentially they're shorter, right, just from the reality of how they could be issued. But 398 days, is what I'm assuming most of them or all of them are. And so to me, it's between certificate lifecycle management and automation and shorter certificate lifespans, problems like this, which are inevitable, and they do happen from time to time, become less and less painful.
-
Tim Callan
Right. Yeah. So imagine in a shorter certificate lifespan scenario, you know, there would be just less of these certs would be affected, right? If every one of these certificates was 90 days, and this dialogue has been going on for two weeks, then in that period of time, 1/6 of those would have already aged out, right?
-
Jason Soroko
That’s the point.
-
Tim Callan
If the EV problem had been fixed on day one, and two weeks had passed, then 1/6 of them would have been reissued and the new certs would be fine, and those wouldn't require revocation. So, shorter - - or imagine a 10 day. You know, we talk sometimes about a future where certs are 10 days long, and in the 10 day timeframe, you fix the problem and by the time this dialogue, by the time we're recording this podcast, they were all replaced, and there's nothing to revoke. So that's a real case study in some of the arguments in favor of shorter certificates.
Now, you're also seeing here someone arguing on the other side, which is to say, well, this is difficult for these companies to do. And, you know, one of the commentators threw up - - this is just a commentators list. So I'm not gonna - - I didn't validate this myself, but this is what the person put on Bugzilla by looking at the CT logs, and I'm just gonna rattle off some of the names on this list. JP Morgan Chase, Delta Airlines, Bank of America, Tesco, Fidelity Investments, American Airlines, Westpac, Banking Corporation, ING Group, Experian, Price Waterhouse Coopers, Toronto Dominion Bank, M&T Bank, Citizens Financial Group, and it goes on. There’s a lot more. And so, you know, these are large organizations with a lot at stake and these are some of the ones who have affected EV certs. Now, this doesn't necessarily mean that these organizations that are listed here are incapable of doing a rapid swap out but there is a representation being made by the CA that their enterprise customers on the whole, their EV customers on the whole, are not going to be able to deal with this.
So if that applies to the people I just rattled off, then you've got to say, you know, wait a minute, guys, you people in the enterprise like, this is risk. This is vulnerability. Its risk of outage. Its risk of essential services not working, and we see that here, because this kind of thing happens. In this case, all these guys are going to be revoked. Right? The final declaration that came out Monday night is that all of these certificates, 24,000+, are going to be revoked in a five day time period and that means that these organizations, if they're not able to deal with that, they're going to have a bad day.
-
Jason Soroko
Tim, they're going to have a really bad day and there are multiple many reasons why they could. You and I have talked about - - quantum apocalypse comes up, but we don't have to go that far. We can wake up tomorrow morning and RSA is broken. We've had at least two to three podcasts just on somebody saying they cracked RSA. Turned out not to be true. What happens if one day it is?
-
Tim Callan
Right. Exactly. 100%. For sure. RSA falls and we all have to switch over to ECC right away. Like, it's been - - you know, people have talked about that scenario and what would occur in that scenario. And so that's another big one.
So this lack of certificate agility is definitely a theme that comes up here. There's this theme about doing things like in a certain way and in the right order and then, you know, it's also illustrative I think of the rules around reporting and discussing these episodes, and following what happens when those rules don't get completely filled in, right, because there were these bits of information that were missing and that became part of the story to hear.
So part of the thing to emphasize is the actual error, I think, is very understandable. And CA's make errors, and this is an error where you see why it occurred. Somebody looked at, you know, some guidelines in the BRs, they didn't connect them back to what the EVG said and that resulted in an error. That's kind of a mundane story. That's not an interesting story. It was about how this bug was reported and discussed and dealt with and the timeline around dealing with it that made this bug particularly interesting and illustrative and, dare I say, a little bit dramatic. And, it's still not done. There's still dialogue. The bug is not closed. We still don't have an exact list of affected certificates and so this might not even be entirely over.
-
Jason Soroko
There you go. And I think we're at the point in the podcast where we can say we've been keeping track of this for everybody. We wanted to talk about it very factually here. You did a great job, Tim, and we'll stay on top of it and if anything comes up, we'll be back.
-
Tim Callan
Yeah. If there's anything interesting and dramatic that's worthy of a follow up podcast, we'll give you one. This might just kind of be the end of it. In which case, this will probably tell the story, but we'll keep tracking it and see where it goes. If you follow the web PKI, it's worth following the story because it is so rich and there's so much that went on here and it's, you know, it's really something that's worth some scrutiny and some attention. Thank you, Jason.
-
Jason Soroko
No, thank you, Tim. See you real soon.
-
Tim Callan
See you soon. This has been Root Causes.