Root Causes 52: New TLS Certificate Incident Research

Tim Callan

So, today, we are talking about some new research. It's interesting. It's been a rich year for research where universities - - what we've seen in the past is we've seen university researchers attempt to find a new exploit against PKI infrastructure or digital certificates and then that's what they publish. What we've seen this year is we've seen the advent of researchers saying I'm going to look at the body of what's been done with digital certificates to find trends and lessons. And so, for instance, we talked about the researcher from NWTH Aachen, the research that did that. We also spoke about some research that came out of Georgia Tech and things along those lines. Right? So, this has been a little bit of a theme that we've seen. So recently, three researchers from Indiana University in Bloomington put out something they call a complete study of PKIs known incidents. And this is, I might as well give him cred, this is Nicholas Serrano, Hilda Hadan and L. Jean Camp, and excuse me if I pronounced your names incorrectly, but these three researchers from the School of Informatics Computing and Engineering at IU Bloomington went in and tried to do a comprehensive study of reported incidents from public CAs and so really, this winds up being TLS. I think it might be exclusively TLS. There could be some code signing and S/MIME in there, but I wouldn't be surprised if there wasn't. So, they went in, and they and they looked for incidents that I think more or less they tried to get a just a broad wide variety of incidents that had enough detail they could actually study them and draw conclusions about the trends and the percentages, and the likelihood of what incidents happen in the world of TLS.

Jason Soroko

Yeah. Thanks, Tim. I've been looking at some of their tables, quite interesting. I'd like to know more about, you know, they say the word incident, but incident could mean a few different things. So - -

Tim Callan

It could.

Jason Soroko

I’d love to hear more about what they are.

Tim Callan

It could, and I think that's part of it. And they get into this some of the methodology. So, by the way, this paper is 45 pages long. It is just a monster. And so, I read it all. I read every word but I'm not going to try to summarize all of it in this podcast because it would make our podcast far too long. But, you know, the gist of it was I think they were looking for credible public accounts of something that could be interpreted as some degree, sometimes very minor by the way, of failure of the public certificate system to do precisely what it was advertised, where there was enough information and that information was credible or impartial enough that they felt that they could add it into their statistics. And they wind up zeroing in on 379. They originally identified I think several thousand and that winnowing process, that the criteria I just told you about, knocked them down to not quite 400 incidents where they felt they had all that information and then each of those was individually studied. Someone went in and they, you know, read the information and they categorized it. They chose how to categorize and classify this incident. And so, a couple more facts about it – the date range, with one exception, the date range for the incidents ran from 2008 to 2019. Because 2019 is when they cut off. They started their research in February so that's where they cut it off. And their sources - they really had four sources. It was Bugzilla. It was the security blogs from Google and Mozilla and then it was the Mozilla Security Policy Forum. And as you and I have talked about in previous podcasts, Mozilla has an outsized influence in terms of how the forums runs and what people discuss their compared to just Mozilla's overall market share. So, things like the Mozilla Security Forum are very important as a collecting place, a gathering place where people talk about episodes like these. So um, you know, and so in response to your question, Jay, yeah, I mean, some of them seem minor. So, we could go to - - for instance, there's a Table. Table 10 is incident types and causes and or Table 10 is incident types. Table 11 is causes. And the incident types are, you know, the most popular one, which represents 40%, 38.5% of the incidents is that fields and certificates are not real compliant to the CA/Browser Forum baseline requirements. So that could be trivial. That could be a misspelling. That could be a purely syntactic error that has no basis on anybody's ability to tell if a certificate really belongs to a certain entity but that would still qualify as an SSL incident right. It would still qualify as a mis-issued certificate and therefore it shows up and that's 40% of them. So, you know, and that's a good thing to look at I think and maybe something for us to dig into is, what's the relative importance of these incidents and what's the main breakdown on what they are?

Jason Soroko

Yeah, when I'm looking at this, I'm even questioning some of the - - I wish we had the researchers on this podcast, Tim, because these were the right people to ask.

Tim Callan

Yeah, I would love to have them on.

Jason Soroko

So, you know, because I see things at the bottom like rogue certificate.

Tim Callan

Right.

Jason Soroko

I certainly know what that has meant for some CAs. It's meant the end of their business. So, when I see, you know, a total of 11, or sorry, 12. On Table 10.

Tim Callan

Yeah.

Jason Soroko

I'm wondering, you know, those 10 certificates in question what they meant, what they really meant by a rogue certificate and what the full impact of what that was because, you know, you can see tables of, you know, hundreds or 1000s or millions of certificates where something was wonky, who knows, but sometimes there's just one that that will sink a CA. That's all it takes.

Tim Callan

Yeah.

Jason Soroko

It's just that the nature of the incident means a lot more than the actual numbers.

Tim Callan

Well, and the original problem that ultimately led to Google Chrome deprecating its trust for the Symantec roots was indeed, I believe it was two, it was a small number of mis-issued certs against a Google domain name. That's what started the whole thing. And at the end of the day, Google says that there were 10s of 1000s of suspect certs but the only reason that it got on their radar and the only reason that that whole investigation commenced was because of these mis-issued Google domain certs. So yeah, absolutely.

So, let's break down what they are real quick. So, I mentioned the first one. Most popular, the percent fields with certificate are not compliant to baseline requirements 38.5%. Next down at 10.3% is non-BR compliant or problematic OCSP responder or CRL. So, OCSP and CRL problems constitute 10.3% of the incidents, the reported incidents, right? That's high, but maybe not - - it's again, it's not a miss issued cert, right?

Jason Soroko

Yeah.

Tim Callan

And revocation checking matters. But it's not a mis-issued cert. Erroneous/misleading/late/lacking audit report, 6.6%. So again, audits are important. That's how we know that they're doing the right thing. But the fact that there was an error in the audit report or the error or that the audit report was late does not actually mean that anything was wrong with the certs themselves. Now, it's problematic because we don't know if there's something wrong, right? But it doesn't mean there is something wrong.

Repeated/lacking appropriate entropy serial numbers 5.8%. So, we talked about a major entropy problem that occurred earlier this year with serial numbers where a lot of CAs had 63-bit rather than 64-bit serial numbers. We did a whole episode on that and I wonder if that boosted the number on this one.

Jason Soroko

That number seems low. They're probably not picking up the incident that you and I podcasted on?

Tim Callan

Yeah, um, you know, you're right because their cut off was February 2019. The incident had not happened yet. So, that number would be higher. If they had just waited until May, June, to do their cut off that number would be higher because a bunch of CAs were affected.

Undisclosed sub-CA, 5%. Surprisingly high to me. I didn't perceive undisclosed sub-CAs to be a significant problem. But they feel that there are a total of 19 incidents, which is more than I would have thought.

512/1024-bits keys was 4.75%. So, people using keys that are not strong enough. It's the kind of thing you'd expect to be seeing.

Possible issuance of rogue certificate. Possible issuance of rogue certificates 4.75% as opposed to rogue certificate which you referenced earlier Jay, which is 3.17%. So, they broke out known rogue certificates and possible rogue certificates as different kinds of incidents.

Then we've got use of SHA-1 or MD5 hashing algorithm is at almost 4%, 3.96? And presumably, I would bet you if we were looking at the details that those would be older incidents, right. Probably happened shortly after the deprecation of those algorithms. I bet you nobody's doing that today.

Jason Soroko

Yes.

Tim Callan

Yeah. CAA mis-issuance on the other hand we know has to be newer because CAA has only been around for a few years. But CAA mis-issuance almost the same amount, 3.7%. And then we got a couple other left - CA/RA/sub-CA/reseller hacked, 2.9%.

Jason Soroko

Wow.

Tim Callan

So yeah, and, you know, I can think of one or two of those that have happened over the years. So, you know, they say they've a total of 11 incidents that they got that were that and then they have a category called other and that represents 10.5%. And there's an explanation. I don't remember it, but I read it There's an explanation of what the other is and it's just all kinds of stuff. It's all over the place. So, you know, they had to make a grab bag and that was other. So that's the breakdown of the incident types. Thoughts on that before I move on to the causes?

Jason Soroko

Yes. As you said, Tim, some of these are egregious, right? Some of these are probably just, you know, things that operationally, you know, could be questioned as, jeez, is that a problem at all?

Tim Callan

Right.

Jason Soroko

So, it really, this is incident typing here really spans the complete spectrum of, you know, very bad to—to question mark.

Tim Callan

And I think that's what the Bloomington team was after. And yeah, and I don't want to sound like I'm criticizing the decision to include these minor things. I think including these minor things is probably right. And furthermore, I think that it's, you know, we're big believers in the BRs here at Sectigo and we think they're there for a reason and you got to follow them. And even if you - - even if the error is fundamentally syntactic rather than content-based or qualitative, it still is an error and it counts. So, we're completely on board with that viewpoint. Nonetheless, I think it's--it's valuable to understand the difference between the two. Right?

Jason Soroko

Yeah. So, Tim, you know, here's the thought this is a thought I had on previous podcast, it's coming up again, which is this again is perhaps a positive thing in the sense that self-policing in the CA industry is good and therefore, you know, things like the Bugzilla are watched very carefully. A lot of good input is going into there. Things that aren't compliant with BR are found and checked and dealt with. But you know, when I see things like, you know, 512-, and 1024-bit keys. When I see, you know, SHA-1, MD5 hashing algorithms, those are things that - -thankfully, these numbers are - - I wish they were zero.

Tim Callan

Right.

Jason Soroko

Obviously, somebody issued these otherwise they probably wouldn't be showing up in this report. But the fact that they're as low as they are and probably - - I'd like to see this report in a timeframe. Like in other words, you know, I'd like to see this report redone in a year or perhaps every year just to see if some of these numbers go down, which would be great because it would show that the self-policing of the industry actually does work.

Tim Callan

Well and it's an interesting point you bring that up. So, one of the other things that they looked at as they looked at these incidents over time and as I said, it's a huge, it’s huge paper because they sliced it all kinds of ways, but you see there's a giant spike in just recent years in terms of the number of incidents. Where just like from maybe about 2015 on you see a huge spike in the number of reported incidents and, you know, you could say, oh geez, it looks like quality is getting worse but I am deeply skeptical of that explanation. I think what you're saying is right. Visibility is getting better. So, there is more - - we have CT logs. Allow people to go and look at certificates in a way they never were previously. We have kind of a continuous ongoing focus on audits and web trust and things along those lines. We have certainly much more active and widely attended, there's much more participation now in CA/Browser Forum just in terms of the number of CAs and the number of browsers and the number of outside industry parties even than there was just a few years ago and so all of this with more activity, more attention and more tools, it's more likely that something that is erroneous is going to be discovered or internally discovered and self-reported because some of that happens, too. And so those are - - I think that's probably why you do see the spike in numbers. And of course, the overall certificate usage goes up and as overall certificate usage goes up, you know, as a percentage of errors remain the same, you're going to get more errors. So, you know, putting all that together I think that does explain this big spike that we see and I think visibility is a real important part of this.

Jason Soroko

It has made a big difference. That's why we support CT logs. It's why - - it's why even we have supported shorter certificate lifespans.

Tim Callan

Yeah.

Jason Soroko

These are all things that as a responsible CA that, that we respond to. This is an interesting report. You know, like I say some of the not so important ones on the list are things that definitely we need to dig into but it's the egregious ones like the SHA-1 certificates that I really wish as an industry that number was zero but we need to look into what the heck happened there.

Tim Callan

And we could investigate, it may be the report has that information. Like we could maybe go figure out what it is, I just don't have it off the top of my head, or maybe also we want to talk to these researchers, and they might be able to give us more insight on this.

I think there's one other thing I'd love to hit before we leave, which is the causes. So, again, they have a chart. This is their Table 11. And again, I'm just going to run down the charts. It’s a little shorter than the other one. And we'll talk about how they categorize the causes.

So, the first cause is software bugs and the percentage of incidents that they attribute to software bugs is 24%. So that, you know, that makes sense to me. These are automated systems; they're running with a lot of software and if there's a software error, you know, where it's inputting the wrong value then that could be perpetuated across certificates before it gets discovered and fixed. And you know, software has bugs, it happens.

The second one is interesting. Believed to be compliant/misinterpretations /unaware. They have an 18.2%. So, this is just CAs not interpreting requirements correctly. Doing something that they believe is compliant and other people disagree and presumably if it's on this list, the other opinion won out in the long run.

The third one, business model. This is interesting. Business model/CA decision/testing, 13.7% and they go into details on this. And, this is, they talk about how a CA is business model could be in opposition to the overall public trust, right? If you're in the business of selling certificates to skeevy people then skeeviness is rewarded. And so, that, you know, that—they feel - - these researchers feel that nearly 14%, nearly one in seven of the incidents belongs to that reason. So, wow, that's kind of high. That's higher than I wanted it to be and higher than I imagined it would have been. So that was a takeaway for me.

Human Error 9.8% and of course, humans have, you know, especially if you're going back to 2008 at that kind of timeframe these were very human intensive processes. I think over the years they become much more software automated and I would hope we would see the human error number go down just because we can solve those problems with computers.

Operational error. I am not sure exactly what the definition of that is, but 7.6%.

Non-optimal request check, 6.3%. So, I think that means the actual authentication, the process of authenticating the identity is not, it’s maybe not performed as well as it could have been. Someone makes a mistake. They've got it that at 6.3%.

Improper security controls, 4%. So again, be interesting to dig into that and see what that is.

Change in baseline requirements 1.85%. So, BRs get changed and CAs don't get the memo or don't change successfully or don't change quick enough. And that winds up accounting for 2% of the incidents that are on the list. So that was where, again, it's important to stay compliant and current but, you know, that's probably more forgivable than a business model decision.

Infrastructure problem, 1.6%.

Organizational constraints, 1.6% What's that? Not enough resourcing? Not the right language skills? Not the right cultural knowledge? Something like that.

Other, 2.1% and no data 9.2%. So, for 9.2% of them they didn't feel like they could answer why the incident happened. But that's the breakdown and so again software bugs is the biggest - - differing interpretations of the BRs is the second biggest in between those two that accounts for almost 50% of the incidents.

Jason Soroko

I'd love to see a mapping between these two tables.

Tim Callan

Yeah. There actually is one. I just don't know how to communicate it over a podcast. But it's interesting. Um, and, you know, some of the hotspots, like some of the things that you see the most of is the intersection of human error with non-compliant, BR noncompliance is very high. The intersection of believed to be a compliance/misinterpretations/unaware with BR noncompliance is very high. You know, those wouldn't necessarily surprise you right. Software bugs - - the incident of software bugs and BR noncompliance is very high. So those are, you know, those are things and again that's 40% of them. So, you get a lot of that.

Jason Soroko

Just for curiosity, Tim, the SHA-1 incidents, what were they caused by?

Tim Callan

SHA-1 incidents were the biggest one was human error.

Jason Soroko

Interesting.

Tim Callan

Second behind that was believed to be compliant/misinterpretation and tied with that is business model/CA decision.

Jason Soroko

Wow.

Tim Callan

So, you know, oh, you want that? Sure. I'll give you that. I don't mind Your money is still green. Right? That might be what that was. So again, these guys have taken nearly 400 things. They've crunched the data in various ways. There is details sitting behind all 400 of those but all of that didn't make it into the paper. So, to some degree, we're looking at the statistics and interpreting, but, you know, I thought this was just interesting. I thought it was worth sharing and thinking about what these researchers discovered in terms of the root causes of incidents over the last 11 years.

Jason Soroko

Yeah, thanks, Tim. It's interesting. And, you know, I think we should endeavor to reach out to these researchers and have them on. I think the audience of this podcast would probably like to hear a little bit more directly from them especially in terms of their methodology and some of the findings they may have found interesting as well.

Tim Callan

So, let me leave you with one more thing because you're asking about rogue certificates and then we can close it. So, they do have a table. Their Table 20 is the causes of rogue certificates. So, I'm just going to read the numbers. They have a total - - I'll just read the raw numbers here. It adds up to 33. So, of 33 incidents here's how it goes. Believe to be compliant/misinterpretation, 1. Software bugs, 9. Business model/CA decision, 3. Improper security controls, 5. Non-optimal request check, 15. So that's the big one, right, is a rogue certificate just because the authentication was done poorly. That’s half.

Jason Soroko

Yeah. So, you know, it's funny human error as you brought up earlier, it used to be such a manual effort to do a lot of these things. I think automation and double checking. So, in other words, that last point non- optimal request check it really highlights the fact that automation really needs to be brought to bear to bring these numbers to zero.

Tim Callan

Absolutely. But of course, that is a two-edged sword because the danger with automation is you get something dumb, right? You're requesting every numeral one or replacing every numeral one with a lowercase L and you pound it out across 10,000 certificates before anybody realizes. So, there is - - yes, I concur. Automation is key and you know, I'm a big fan of automation but we need to be conscious of the danger when automation and software error collide.

Jason Soroko

Oh, heck yeah. And therefore double checking, you know, triple checking needs to be part of this. It's just—

Tim Callan

Yes. Internal audits, self-policing, these sorts of things are critically important for a CA to be able to maintain quality especially a volume CA.

Jason Soroko

Interesting report, Tim. Thank you.

Tim Callan

Super interesting and I'll work on it. I'll reach out to these guys, and we'll see maybe if we can get a guest. But either way, I just thought this was good to share with the listeners.

Jason Soroko

Wonderful.

Tim Callan

Thank you, everybody. This has been Root Causes.