Machine learning has infiltrated the world of security tooling over the last five years. That’s part of a broader shift in the overall software market, where seemingly every product is claiming to have some level of machine learning. You almost have to if you want your product to be considered a modern software solution.
This is particularly true in the security industry, where snake oil salesmen are very pervasive and vendors typically aren’t asked to vigorously defend their claims. Many vendors just say, “it’s machine learning” and wave their hands, and you’re not supposed to ask further questions.
I tested this theory at the recent Black Hat conference in Las Vegas, where I asked every vendor what they did, and more often than not, the answer included some version of “our machine learning…” When I tried to dig into what that meant, there was really no answer.
With this in mind, I wanted to try and debunk a number of myths related to machine learning in the security industry, and how you can avoid pitfalls in your own journey to keep your critical business applications and infrastructure secure.
In the field of computer science, “machine learning” is fairly well defined: it’s a particular set of algorithms, generally classified into four or five families, that aim to improve a computer’s ability to perform a certain task. When a vendor says “machine learning,” they may be referring to something else entirely. In a lot of cases, what they’re doing is a set of advanced statistics that are not machine learning algorithms, they’re just well-chosen mathematics.
Don’t get me wrong, those things are useful. Concepts like ratio-based alerting let you use better mathematical mechanisms than an average by using different types of distributions or numbers of standard deviations. There’s a play on an average called a “winsorized mean” that’s pretty useful, where you cut off all the outliers and only measure three standard deviations in either direction. There’s all these numbers that you can use that aren’t ML, but can produce ML-like effects.
They are certainly more convenient to develop, because they’re just math, and programmers know math. They don’t require the data sets that machine learning does. So, oftentimes, “machine learning” can just mean “really good math.”
There’s been a whole debate among technocrats, and particularly people who are trained in comp-sci, about the use of the term “artificial intelligence” versus the term “machine learning.”
In computer science circles, those two things are very distinct from one another and very well-defined: machine learning is a subset of the field of artificial intelligence research. A lot of times, when vendors say, “AI-powered cyberthreat hunting,” from a technical perspective, what they actually mean is machine learning, or statistics, and they’re just calling it AI.
We’ve reached the point now where the popular connotation might overtake the technical definition. What AI means is just whatever anybody thinks it means when they use that term. The myth here for comp-sci people is that there’s any meaningful difference in the public imagination of artificial intelligence versus machine learning, and that certainly has begun to apply to vendor claims too.
Machine learning is a set of algorithms, as many as a few hundred. If you Google “machine learning algorithm family tree,” you’ll find massive graphs of families of algorithms that have grown up over the last 10 years or so, and actually entered into the security space probably within the last five years.
Machine learning is not a monolith, and broadly speaking, one way to categorize those algorithms is “supervised learning” versus “unsupervised learning.” Supervised learning is essentially where you have a data set that has already been labeled. In other words, someone has gone through structured data and said, for example: “This image is of a horse. This image is not of a horse.”
Supervised learning requires a lot more grunt work, in terms of labeling the data, unless you’re taking a data set off the shelf. The problem: in security, it’s really only good at spotting very specific types of threats, or very specific types of behavior, because it relies on that behavior being present in the training data set.
Unsupervised learning tries to capture patterns in data that has not been labeled, and it’s easier in terms of human intervention, because you don’t have to just go through and do the grunt work of labeling all the data. It requires massive data sets and years of tuning. The problem with unsupervised learning is having to spring it on your customer base when it is essentially useless, before it knows anything, and it’s going to remain useless for any number of years, as it gets tuned.
Based on who you are, and what kind of data sets you use, and how your machine learning works, vendors are going to claim that one is better than the other, and the one they do is better than the one that anybody else does.
Customers need to be practiced in the art of digging into the details of machine learning so they can ask the right questions. As I mentioned before, during my exercise at Black Hat, nobody really had any answers for me.
When you’re looking for a vendor, ask if they conduct supervised or unsupervised machine learning. Ask if their data is anonymized (it can’t be figured out who it belongs to) or de-identified (personal information is scrubbed). You need to be able to ask these questions to evaluate the vendor’s claims, and on a mental level, see if they know what they’re talking about. If a vendor just straight up can’t tell you what their machine learning is, they’re probably not worth talking to.
If, as a vendor, you don’t do machine learning very carefully, and you don’t get it exactly right before giving it to your customer, you’ve essentially just given them a new pile of false positives that they have to worry about. Vendors should be able to point to a track record of success to validate their claims.
Finding contemporary attacks using machine learning attacks is very difficult, in particular, as attackers live off the land. They try to use all the same tools that the systems administrators and DevOps people are using, so as much of the traffic as possible looks legitimate. Your supervised machine learning was not trained on legitimate traffic, and your unsupervised machine learning doesn’t see it as anomalous. It’s hard to catch them for that reason, and it’s the classic “false positive, true positive” story all over again.
Either you’re missing stuff, or you’re introducing a lot more noise for people to investigate. In this case, you’re letting a machine do it, which is even riskier.
Stay tuned for more debunking of cybersecurity myths in the coming weeks!