Redirecting you to
Podcast Mar 26, 2024

Root Causes 372: Bugzilla Bloodbath

It's a bloodbath on Bugzilla. Since March 9, more than 25 new Bugzilla bugs been written up, which is 10x the typical pace. And it's not over. In this episode we explain what is going on and why.

  • Original Broadcast Date: March 26, 2024

Episode Transcript

Lightly edited for flow and brevity.

  • Tim Callan

    Okay. So today we need to talk about what I am calling the Bugzilla bloodbath.

  • Jason Soroko

    Bloodbath. Goodness gracious, Tim.

  • Tim Callan

    So, Bugzilla, as our listeners know, if you've listened to our recent episode on goings on on Bugzilla, or some of our episodes in the past, where we explain it, Bugzilla is where the web PKI community goes to report and discuss incidents of non-compliance with CAs and a certain amount of non-compliance is expected because this is complicated stuff that everybody's doing, and there's a lot of CA's. It's a complicated ecosystem and the rules change and so things routinely get written up and worked out and discussed on Bugzilla and these things usually happen at kind of a fairly slow, fairly quiet, fairly non-alarmist pace. The last two weeks have been crazy.

  • Jason Soroko

    I've heard this. I can't wait to get an update from you, Tim. Now we, you know, we did have a podcast, episode 370, talking about some of what had been going on but you've got a lot more to talk about apparently.

  • Tim Callan

    In episode 370, and I think we mentioned at that time that we're not a real time news reporting agency and things we're probably going to keep going and they have kept going. And so in particular, what we're seeing is that there's this flurry of new bugs. So to put things in perspective, typically, kind of a normal run rate for Bugzilla is that for the whole ecosystem, one or two bugs will be opened per week. Like we're probably averaging one per week. Some weeks there are none. Some weeks there are two but think of one to one-and-a-half as an average and you're probably in the right ballpark. Since March 9 of this year and we were recording this episode on March 25, so in very slightly over two weeks, we have seen 26 new bugs opened up in Bugzilla. So that is 10x the expected rate.

  • Jason Soroko

    Yeah, exactly, Tim. I was gonna say, even I know that's not just a lot, but that's a whole heck of a lot.

  • Tim Callan

    Yeah. And one or two of them are just kind of whatever they happen to be and those are sort of the background noise bugs that we would probably be expecting anyway. But all the rest of them, except for maybe three of them, all the rest of them, are on a very distinct theme.

  • Jason Soroko

    Yeah. Okay. So Tim, you know, what I'm thinking is some kind of a rule must have changed somewhere that people weren't hip to. Is this part of what's going on?

  • Tim Callan

    Not really. I think this is more uncovering a certain baseline failure to meet the requirements that has been around for a long time. So there is a precipitating incident behind all of this. And the precipitating incident is that a whole bunch of CAs received inbound bug reports in a fairly short period of time. I didn't count the number, but let's say 10 or so that have reported so far, and there may be more that haven't, have received inbound reports and they have all come from the same email address. And that email address, which this is public on Bugzilla, you can go read it. Or at least some people have said what the email address is. More than one CA has referenced this email address and I'll bet you it's the same email address for all of them. And that email address - you want to hear what it is Jason?

  • Jason Soroko

    Let’s hear it.

  • Tim Callan

    It is [email protected]. Now, that may not be meaningful to the average listener, but let's unpack this. I think we know what experiment means. Linting is a process whereby automated tests are made that look at code, in this case certs, and look for very specific, discreet, unambiguous errors. And linting is a common practice in general and going through and catching the stuff that could be just automatically caught by a script and linting is a big practice in the world of certificates where you go through and you look at certificates and you say, are they right or are they wrong?

  • Jason Soroko

    Root causes episode 175. We explain what a linter is in great detail.

  • Tim Callan

    God, we're so good. We've got it all worked out. So let's take that first word, Dixon and let's connect that with the last word gmail.com because the name of the person who runs the Chromium root store is Ryan Dixon. So while I haven't gotten confirmation directly from Ryan, I’ll betcha if I asked him he would say, yep, that's me. And I haven't gotten confirmation just because it's not that important. But my very clear belief in what's going on is that the Dixon linting experiment was some kind of project that is run by the Chromium group or Ryan Dixon as a private individual, one or the other, to try to get a sense for the level of compliance with a number of specific requirements that are in the baseline - In the BRs. And that a number of CAs failed at this and probably have been failing at this on an ongoing basis and the Dixon linting experiment, either as part of an automated script or somebody composing individual emails, went out to these CAs and said, hey, there's a problem with one or more certs have been identified with this problem, you know, what are you going to do about it? And then at that point, CAs are supposed to kick in, and go into a process from there and that's where problem number two comes in.

  • Jason Soroko

    So, Tim, you know, in oversimplified English, for folks who aren't deep in the certificate world, certificates have to have to conform to a whole lot of rules. And this letter, basically - -

  • Tim Callan

    They are there for very good reasons.

  • Jason Soroko

    Very good reasons.

  • Tim Callan

    Some of it is about interoperability so that software can pick them up and work with them correctly. Some of it is about identification so you can know that something really is what it is. Some of it is about communicating the certificate practices so that the web PKI and the relying parties as a whole can look at any individual cert and decide what's the level of trust I have for it. There's a bunch of reasons why specific things have to be done in a specific way and if they're not, it's deemed to be non-compliant.

  • Jason Soroko

    So Tim, it is the CAs responsibility to self-police.

  • Tim Callan

    Yes.

  • Jason Soroko

    And self-report on Bugzilla.

  • Tim Callan

    Uh-huh.

  • Jason Soroko

    And what you're saying is that Mr. Dixon was running an experiment.

  • Tim Callan

    We think. I think.

  • Jason Soroko

    We think. We think. It’s not a difficult one to piece together for sure the way that you've explained it. But he's got a tool just like we all have tools. The CAs and other people have tools to go and check the conformity of certificates to these rules and this linter presumably found a bunch of things that were non-compliant, non-conforming, and therefore a bunch of emails got sent by this and here we are.

  • Tim Callan

    That is my belief about what happened. And as I said, I could have counted them. I didn’t before, but it's in the ballpark of 10 CAs who have written these things up. And again, let me emphasize, we don't know about any communications that went to CAs who haven't written anything up. So we don't know that there aren't more. These are the ones that we know about, and we have seen.

    Now, of course, if a CA didn't self-report, depending on when they got that communication, that might be another failure in the process. But let's deal with that later if that turns out to be the case. So that's about 10 CAs. Each of them writes up an individual bug about what they got wrong. That would be 10 bugs. Right, Jay?

  • Jason Soroko

    Sounds like it.

  • Tim Callan

    But there's 26 bugs, right, Jay?

  • Jason Soroko

    So that suggests that there might be follow up issues to issues.

  • Tim Callan

    So what are the others?

  • Jason Soroko

    Yeah.

  • Tim Callan

    So what are the others? So the next theme is that for most of all of these CAs they had one or both of the following errors, each of which requires its own individual bug to be written up. And the first error is failure to respond to the inbound report correctly.

    So if an inbound certificate mis-issuance report is filed, and it is filed in the correct way, right, I can't just tie a note to a rock and throw it through your window and claim that I filed an inbound report. But if it's filed and it's filed the correct way, then the CA has a requirement to respond to it very specifically within 24 hours and address what's been said. And most of them, not most of them, many of them didn't do that. Just didn't reply within 24 hours, didn't reply at all, didn't realize they had gotten a note. So there were occurrences like that and those are other another set of bugs. And then if these certs are indeed mis-issued, because the report could be wrong, right, a report could be incorrect and that happens sometimes. But if the certs are indeed mis-issued, then the CAs have an obligation to remediate it. To fix it. And if they fail, in particular, in the process of fixing it, if they fail to revoke the certificates within the required timeframe, that is another failure. That is another non- compliance that is separate from the original issuance problem, and it requires a separate Bugzilla incident.

    So we literally have seen CAs write up all three of these and post them all more or less at the same time. We had an issuance problem; we also didn't address the inbound; we also didn't revoke the certificates on time. And bang, bang, bang. So you know, you start to add those up, right? If there are, let's say, 10 CAs that have this problem, there's like 13 problems in addition to that, you know, that's, you know, 230% of the original amount, right? So most of them are having this problem in one form or another and those bugs are showing up as well.

  • Jason Soroko

    Geez, Tim, though, I don't want to let you gloss over and you don't have to tackle this right now but I'm not gonna let you get away with not talking about the non-conforming time to revocation. That's a serious one.

  • Tim Callan

    So the non-conformance to time to revocation is a really interesting one. And so let me tackle this in a couple parts.

    Part number one is the rules are unambiguous. They are unambiguous. The rules are clear. They're clearly written. There's not a lot of room to debate what the rules actually say. And the rules say that if there's a mis-issuance, it's got to be revoked in a certain time period and the time period depends on the nature of the mis-issuance, but it's either 24 hours or five days. And all of that is really easy, like there's not any real wiggle room there. But what we've seen is the CAs have been saying that the revocation would be unduly disruptive to the ecosystem, right? That the consequences of the revocation would be worse than the consequences of the mis-issuance and there is actually a little bit of wiggle room for that. There's some wording that says that if the in the event that the revocation would be unduly harmful to the internet ecosystem, then the CA has the option of not revoking within this timeframe. And so every CA is, not every CA, maybe every CA. Most or all of these CAs are grabbing that. And from what I've seen, every CA that is not revoking on time is going back to that.

    So the basic rationale goes, these certificates are doing very important things and if we were to revoke them and the subscribers cannot get them swapped out in time, and if we were to revoke them anyway, then there would be outages that would be very disruptive to various important services that people rely on and therefore, we gave them extra time. And some of them gave them a couple extra time. Some of them have gone into their bugs and saying they're giving them like, an extra month, like a huge, or two months, like just huge amounts of time. And, so that, of course gets into some real, I would say difficult and tricky conversations. Like if any subscriber can just have as much time as they as they want because they can't swap the certs out then why do we have a mandatory revocation at all? Right? What good is it doing?

  • Jason Soroko

    Tim, exactly. Exactly. We should take it seriously. And I understand the pain. I think I understand to a point. The problem is Tim, it goes counter to something we've mentioned at least 100 times within almost 400 episodes now. And that is certificate lifecycle management. In other words, the sheer number of unmanaged certificates that are out there, the lack of automation and the need for shortened certificate lifespans. if you just take those three things together, I don't think we'd be having this conversation as much. We wouldn't have CAs saying the pain of revocation is greater than the breaking of the rules or whatever it is. I think, Tim, we got to address this head on. I propose a very near term future podcast about why are mass revocations so painful still to this day?

  • Tim Callan

    I think we should, and you know, there's been a response and there have been multiple responses that are all kind of a flavor of the same thing that have replied, have gone on multiple of these bugs. And I think there will be more of these responses coming, where commentators have said, well hold on, if these organizations are incapable of dealing with a five day revocation event, what would happen if there was something catastrophic? What would happen if there were a Heartbleed style, zero day problem? What would happen if there were a large scale security flaw like the 63-bit entropy problem that was discovered maybe four years ago? What would happen if there was a private key theft? Like there's a number of things that could occur where we would not be able to make the argument of, well, this isn't really a security problem. And under those circumstances, could they deal with it? And I think this is a bit of a conundrum, right? Because if the CA says yes, then the response is, well, then they should deal with this. And if the CA says no, then the response is, really? That's alarming. And so there isn't really a good answer to that question. Every answer looks bad. And that's kind of the fundamental problem with that excuse for failure to revoke on time.

    And you know, I said this in the earlier episode that you talked about, which is these rules are incredibly byzantine, and they change and CAs have issuance errors. It's part of our world and if you were outside this world, you'd say, oh, well, how's that possible, just follow the rules. But it is possible. It does happen and it happens to every CA. That is not the failure. But the failure to revoke, the failure to remediate, those are failures, right.

    There also have been debates about ceasing issuance of certs. And that's another area where, you know, CAs have action they can take, and they don't all take it. And so, that becomes a lot of the debate, which is, look, you have been entrusted with the public trust model, right, with the wheel of the web PKI and you are not doing the things that you signed up to do. Right?

    I used this analogy with someone not too long ago, which is, if you go to work for the fire department a lot of the time if you're a firefighter, right, a lot of the time seems like it's a good life, right? You hang around the fire station and polish the brass and, you know, do pushups and cook big spaghetti meals. But every once in a while, you have to run into a burning building. Now, if you decide you're not willing to run into the burning building, then you don't get to have the rest of that stuff. Right? And CAs who don't do their revocation are like firemen who decide not to run into the burning building. And I think this is a problem. Because if you're going to be a public CA, and you're going to be entrusted with all of this, and you're going to sell your certificates and all this stuff you're gonna do, then somewhere along the line, when the time comes, and you have to do a forced revocation, you have to do it. And that is not the time for people whose stomachs are too weak to disappoint their paying customers. And that is something that we've definitely identified as a real problem in the current CA community.

  • Jason Soroko

    Tim, it's so true. Look, I think you've really done a great job here explaining what's going on in the Bugzilla. I’d almost like to title this series, We Read Bugzilla So You Don’t Have To.

  • Tim Callan

    Yeah. You don't want to read Bugzilla. Sometimes it's dramatic and fun. A lot of the time, it's just way down in the weeds but, um, you know, this has been crazy. This has just been crazy. I have not witnessed anything like the last two weeks in my entire history with this world.

  • Jason Soroko

    I'd like to add just one more thing to what you just were talking about, Tim, and that is the responsibility of the CAs and you're so right in saying mistakes, they're just a normal part of the noise. It's the response to what happens. It's the response to the bugs that are what's important. Now, the reasoning for some CAs being reticent is something I'd like to elevate the conversation so that we're not down in the weeds of the esoteric and you know the folks who just play in this game, people who even know what a linter is. Those are not the folks who - - I want to elevate the conversation to everybody else so you understand what is going on here? And I think that's what part of that next podcast will be.

  • Tim Callan

    Yeah. I think so. And let me just leave this, just like I said, the last time. I don't believe the story is fully played out. Among other things, most or all these bugs are open. All the bugs I've talked about are still open. So you know, there's a whole bunch of bugs that are going to go through their own resolution. There are active commentators in the community who I think are going to continue to be active commentators. There's going to be back and forth. We don't know if we've seen all of the new bugs. We had them being opened all the week last week. There may be more coming this week. We also don't know if there are some who are never going to open a bug on themselves and a bug will get opened on them. I can tell you from a little bit of searching, that I'm aware of two CAs that have non-compliance problems in the form of these other ones who have not written up bugs yet. So either they have bugs that are coming, or they weren't found in this project or something. So, this story is probably not over. There will probably be another return to this one. But, you know, this is where it stands right now. So let's see how all this continues to develop. It is crazy. I do think if you're a follower of the web PKI which we all depend on whether we realize it or not, these are interesting times and it's worth paying attention.

  • Jason Soroko

    Let's pay attention and let's elevate the conversation, Tim. Great reporting.

  • Tim Callan

    Thank you.

  • Jason Soroko

    And thank you for that.

  • Tim Callan

    All right. This has been Root Causes.