I’ve Got 99 Cyber Security Problems and Accuracy is One
The successes of machine learning and artificial intelligence (AI) has spurred excitement and innovation in nearly every field imaginable. Part of the reason for this are the volumes of data generated by various processes. Security vendors are no different; everyone wants to be the first to release the next “SUPER VULNERABILITY MACHINE 5000”.
Given the hype, it is incredibly important that people be informed about not only the limitations of machine learning but also understand the necessity of domain expertise, human validation in the security space.
In our company we have deep domain expertise in cyber security and AI; we use AI for automation and early warning of weaponized vulnerabilities. In the last two years, we developed DABOMB and released “KOADIC C3 Hacking Tool” at DEFCON.
We continue to demonstrate how AI can assist with improving vulnerability and threat prioritization and proactively reduce cyber risks. As a part of our software as a service (SaaS) solution to validate security controls and validate attack methods, we are able take our proprietary exploit development and weaponization capabilities use the power of automation and AI to take known activities and turbocharge them with precision.
Questions like, “does the quality and quantity of data justify using machine learning or AI?”, “How complete is the dataset? Is it balanced or sparse?”, or, “How accurate is that algorithm, really?”, are important for anyone considering integrating machine learning into their risk management toolbox.
As quoted in Domain Expertise and AI: Conquering the Next Generation of Cyber Threats, “…but AI cannot do it alone; it must be critically informed by domain expertise, or it could become just another buzzword.”
Let us consider the first question, “does the quality and quality of data justify using machine learning or AI?” Like most questions worth answering, it depends on the situation. Machine learning could conceivably provide meaningful results and insight to risk analysts, depending on the data. For example, large organizations might have millions or billions of data points consisting of heterogeneous data such as log files, network packet captures, malware samples, vulnerability scanner results, and so forth.
To the uninitiated, it sounds very impressive to tell a potential customer that your fancy, cloud-based machine learning engine was built using all of that data, but to a scrupulous scientist, especially in the security space, such claims sound more and more like a snake oil pitch.
As an example, imagine a machine learning engine that claimed to derive its information from millions or billions of vulnerabilities. This seems like a fantastic (bogus) claim because there are only around 100,000 Common Vulnerabilities and Exposures (CVEs) in the NVD. Taking this into consideration, it seems more likely that the “billions” of vulnerabilities are actually duplicates from different vulnerability scanners, duplicate vulnerabilities derived from client networks, or some a similar factor.
This consideration could decrease the effective sample size for the algorithm by several orders of magnitude (e.g., turning 1,000,000,000 to 100,000). Differences like this can have a significant impact on models, especially those of high dimensionality.
The previous analysis has implications for the second question, “how accurate is that algorithm, really?” For example, say a model requires 1,000,000 data points to train but the effective sample size ends up being significantly smaller, prediction accuracy will suffer. Similarly, imagine a problem where a machine learning algorithm needs to differentiate between English and Spanish webpages and is trained on data containing only 1% Spanish.
The algorithm might, in a controlled setting, achieve >90% accuracy. However, the algorithm could trivially label every example as English and technically still achieve 99% accuracy. This is problematic if the application is taking random samples of English and Spanish webpages on the Internet, since the Spanish:English ratio is not likely to be 1:9. Even worse, such an trivial algorithm could potentially outperform a true machine learning algorithm.
Clearly, these are serious issues not only because they could cost an organization millions of dollars in license and services fees, legal fees, and other expenses. The cost is not limited to money; society’s increased reliance on technology has lead to the Internet of Things (IoT) revolution. Some of the world’s most critical infrastructure, such as power plants, waste treatment facilities, defense installations, and hospital equipment, are connected to the Internet. These targets have the potential to cost millions of human lives.
These kinds of stakes justify and obligate, both morally and ethically, anyone considering using machine learning technologies to ask questions such as the two addressed in this post. This post is not meant to exhaustively explore these difficult questions but to remind people of the high stakes in this domain and what drawing the wrong conclusions based on misrepresentation can cost.
All hope is not lost, however, as long as the organization in question is diligent in their data validation processes. At RiskSense, we draw on many industry data sources vetted by tens to hundreds of specialized scientists and engineers. Our team of threat analysts, malware reversers, and exploit writers validate data and corresponding statistics about the data feeds, from which we derive our RiskSense Security Score (RS3).
Sean Dillon and Dylan Davis were among the first to perform analysis on Double Pulsar and have also released exploits such as EternalBlue (MS17-010), ExtraBacon, and SMBLoris.
Sean Dillon, one of our top Security Analysts, demonstrated that it is incredibly easy to fool machine learning engines for malware detection with a simple “Hello World” program. The experiment by “zerosum0x0” gained attention because the vendors flagging the code are advanced defense systems that promote their usage of machine learning.
The code in question can be seen in images on Twitter. Again, this is training code, something all novice coders use. “Why then is such basic code being flagged as suspicious, harmful, or outright malicious by notable vendors such as Cylance, Sophos, McAfee, SentinelOne, CrowdStrike, and Endgame?” “Here’s why the scanners on VirusTotal flagged Hello World as harmful.”
RiskSense uses several data feeds for threat analytics, and analysts like Dylan and Sean sanity check threat feed statistics. Consider Offensive Security’s Exploit DB, which contains 40k+ exploit to CVE mappings but only 21770 unique CVEs.
Similarly, our internal CVE to malware mappings number in the tens of thousands but just 2110 unique CVEs correlate with malware families. These numbers might seem surprisingly low considering all the talk from vendors about million to billion point data sets.
To a human analyst though, these numbers seem more reasonable. Take the 2110 unique CVEs correlated to malware families. This number becomes reasonable in a few situations. First is when you assume that malware authors are lazy. They want the quickest return on investment. Why would they spend time finding a new exploit or writing a new malware when they can easily clone a GitHub repo, modify some code, and have a totally “unique” piece of malware? Malware also does not necessarily need an exploit; attackers could just as easily social engineer a target into installing a trojan with administrator privileges.
In the exploit example, it is totally reasonable for multiple individuals to submit different exploits for the same CVE. This is the case for many of the exploits in Exploit DB. Additionally, people will update exploits or write them in one language and port them to Ruby as a Metasploit module, for instance.
It is this context that a machine learning algorithm won’t capture (sorry, but we haven’t solved the problem of general intelligence) and is the reason why it is vital to scrutinize incredible claims but also to understand that, at present, human analysis is crucial to getting a 90% + accuracy machine learning algorithm that reflects reality and a “common sense” notion of accuracy, not some trivialization of it. RiskSense is built on this idea and is why we are leaders in this industry.
Stay tuned for our Weaponization Analysis Series: How we at RiskSense use these insights to assist with Vulnerability and Threat Prioritization and Attack Surface Validation.