False negative data exploration in Machine Learning powered SOC.
INTRODUCTION
Imagine yourself to be in the shoes of a Security Analyst working at a SOC. You get to work, fill your coffee mug and settle in when the SOC Manager comes by to hand you a 4000-page PDF containing 1.6 million domain names. Your task: Spot threats in the given data [and submit findings by the end of day!]. SOC Analysts are expected to explore volumes of data and identify threats, but is data exploration at such scale plausible with accuracy under such deadlines? Truth is such questions are not up for debate anymore. Such tasks are viewed as routine for SOC Analysts.
-To get under the hood of such tasks, I present a real-world Proof of Concept (PoC) using the infamous Necurs Botnet data to help readers understand where the buck falls short and how parties involved in threat hunting massive data sets should set expectations should they choose to employ predictive analytics. The entire Necurs Botnet data set comprises over 6 million domain names, so the task could even have been to spot a threat in 6 Million domain names in a day by one analyst.
-Machine-learning-powered predictive analytic tools in a SOC environment promise to transform such massive-scale threat detection tasks into a reality. Such tasks are now indeed doable to satisfy real-world SOC requirements, albeit with a caveat: The task can be done, but with a small % of error. Some tools may make claims of prediction accuracy of up to 97 %. Sounds promising! Well, then why not 100% accurate? What happens to 3% of inaccurately predicted data? That is precisely what this article is about.
MOTIVATION
-I undertook this stoic mission after learning about Microsoft's takedown of Necurs infrastructure. Necurs Botnet, as we know, was the world’s largest online crime network. Its spam emails spread to every corner of the globe. This made the experiment planning even more enthralling. This original writing follows a live experiment that I conducted from scratch using the data published in the court records as a result of the courts unsealing the court order enabling Microsoft to take control of the Bot network.
-This writing is primarily intended for SOC Analysts, Detection and Response teams who may already be employing predictive analytics in their day-to-day operations and not realizing some risks such tools may introduce in their operations. It may also interest those who are currently using (or intend to use) any ML powered predictive analytic tool for massive data exploration for threat hunting and threat prediction (in a corporate setting). There already exist several products and services that SOCs are using, and this article may prepare consumers of ML powered predictive analytics in understanding certain strengths and weaknesses of ML in security operations that were learned during the course of this experiment.
APPROACH
Data was purchased from Public Access to Court Electronic Records (PACER) and comprised solely of Necurs DGA Botnet domain names which were taken down. Indeed an unusual place to obtain open data that was not available on the open web. After parsing out domain names from more than 4000+ pages of PDF format court records and converting to CSV, there were just over 1.6 million Necurs Botnet DGA domain names to analyze.
Total 1.6 million domain name data were used as a starting point with one mission — to develop a predictive analytics tool to predict whether a given domain name is predicted as DGA or Non-DGA one. For the curious, Case number 1:20-cv-01217-LDH-RER is worth reading page to page. [Case file is available on my Git]
Following obtaining data, several industry and academia papers were referenced. A set of papers were identified which disclosed their data and code. Research grade code and open data was used to conduct this experiment on AWS. Several papers and word made claims of varying prediction accuracy, however no formal work was found where research was focussed on false negative reduction.
INITIAL DATA EXPLORATION STEPS
-The SOC is equipped with a variety of commercial, open-source and internally developed tools which could assist the analyst in performing domain names research. SIEMs would be a natural choice to import data to explore. However, which tool or service can be applied for exploring data which is voluminous and of which the analyst has zero-knowledge? How does a Security Analyst check whether a given domain name is suspect or not suspect? Assuming no external threat intel or metadata is available. Also, no data enrichment, no DNS lookup data, no reputation data is available. How does one proceed?
Quick excel eyeballing of parsed data showed 12 TLDs with occurrences ranging from 80K to 220K. I looked at the top three TLDs: .mn ( Mongolia), .cc (Cocos island) and .sc (Seychelles), .co (Columbia) , .tv (Tuvalu). Except, Mongolia, the rest appeared to be islands. Was the Malware author planning a campaign to target islands? — this was the first question that popped up during the first pass of the data summary. The rest appeared popular TLDs, which was not surprising.
DGAs domain urls can be broadly categorized as domain names that comprise random characters and others that embed dictionary words along with other numeric or string data. The data obtained from the court records contained a mix — some domain names contained only strings, and some contained English dictionary words and strings.
This experiment also incorporated openly available Alexa 1 Million and other publicly available known DGA data into the experiments to train and improve model accuracy, expanding model training/testing to 40+ DGA Botnet families with over 97% accuracy [and 3% error]. The main focus of the experiment was to reduce errors initially, but later turned to exploring false negative data.
During this experiment, it was humbling to learn that in classifying large numbers of domain names using machine learning or deep learning techniques, there always will be a small percentage of data that will be incorrectly predicted and classified as good domain names (non-DGA), when in fact they are bad (DGA domain names) — that false negative data is bound to generated in predictive analytics, and this risk needs to be understood and managed.
The experiment revealed that the problem of exploring false negative DGA domain name data, data that was erroneously predicted as Non-DGA when, in fact, they are DGA domains is non-trivial. What is a false negative in the context of DGA domain names? Hopefully readers find the graphic below interesting and intuitive. False negative is the undetected 3% real wolves (in the left bottom quadrant). The SOC is quite familiar with the problem and burden of managing false positives but how about false negatives ?
The Deep Neural Network model used in this experiment initially yielded an accuracy of 97% with an error rate of 3%. An error of 3% would mean that 3 out of every 100 randomly generated domain names are false negatives and pose a real threat, if unaddressed. The data exploration problem then was to examine the false-negative data set formulate techniques to understand why Deep Learning techniques unable to predict accurately.
After reviewing several academic and industry papers to understand approaches to explore false-negative data, it appeared that there is no clear, straightforward method to explore false-negative data or to obtain near-zero (or zero) false negatives. The business need for zero error is undisputed.
After all, just one false negative C2 is enough to damage Victim-0 (unknown true positive DGA), and hence the quest for perfection (zero false negatives) appears justified. After browsing academia papers, I made attempts to pair “models" and "chain models", but no combination of models could truly yield zero or near-zero accuracies with consistency. [See image above]. Solution architecture ideation included humans in the loop to provide inputs to the model upon discovery of inaccurate prediction.
False negatives are a bigger concern than false positives because they result in real threats going undetected, hence reducing error is not sufficient, the error needs to be eliminated, but how? What is the problem in the data set that is not getting predicted accurately?
During the course of this experiment, I realized that most measures and computations applied to the analysis of algorithmically-generated domain names were developed to analyze numerical data. Surprisingly, they appeared to give good results, but the fact remains that the domain name problem is a problem in the natural language domain. Hence attempts could be made to explore natural language analysis techniques to explore false-negative data set.
-After fine-tuning the model, another 1000 Necurs Botnet data published by Microsoft was tested. The model predicted 967 domain names accurately (96.7%) and 33 inaccurately (3.3% false negative). A quick review of true positive and false negative data sets appeared to indicate the presence of English language dictionary words in the false-negative data.
Concerning exploring false-negative data, what options does the SOC Analyst have at this point?? Option 1 - go back to the ML drawing board and tweak the model - but, what tweaks? Option 2 - get more data to train the neural network or "troubleshoot data" used for training - but, what kind of data can reduce false negative? Option 3 - perform data exploration on 3% false negatives - but, how? Option 4 - submit 97 % accurate findings and 3% error to the SOC Manager (tardy job, passable in some environments - covers "insurance needs"). Option 5 - Manually test false negatives till the cow comes home - 3% of given 1.6 M domain names = 48,000 domain names to be manually tested. Option 6 - Commence a "data fusion" project because the 3% false negatives have several uncertainties associated with the data and only additional data can reduce uncertainty.
Uncertainty, by definition, is a lack of knowledge about a value or outcome, most often expressed quantitatively. How does a SOC analyst then characterize uncertainty associated with false-negative data - zero-knowledge data produced post ML?
Uncertainty is classified as aleatory or epistemic, depending on whether it is due to random variation or unknown factors. In the context of predicted false-negative DGA domain data, we could not say with any certainty whether the uncertainty associated with DGA domain false negative is aleatory or epistemic. This is a non-trivial problem for non-experts and, perhaps, experts too. Nevertheless, as SOC Analyst, one is expected to be no less than a magician - data must be explored and informed decisions must be made even in a zero-knowledge environment.
Information Uncertainty within False Negative Data
An attempt to obtain information from zero knowledge of false-negative was attempted by computing entropy of all false-negative data. Entropy provided information content of unintelligible strings of characters called as domain names. Now, I had some knowledge of the false-positive data and some interesting observations too. Key observation - entropy of false-negative DGA was consistently less than true positive DGA. The exercise, at this moment, felt like building an aircraft in flight and correcting course without the Artificial Horizon while new knowledge was available to correct the course! Challenge - where to steer with this new knowledge? Abandon mission or continue data exploration ?
In continuing mission, the 3% false negatives with lesser relative entropy was examined closely. A pattern emerged while comparing with true positive - falsely predicted domain names contained English dictionary word whereas True Positives did not. [This observation was later confirmed while random sampling 44 DGA Bot families]. This small sample of words containing meaningless characters from the English dictionary. Was this problem ( of exploring the false negative data) a problem in the realm of natural language requiring application of computational linguistics ? Where does a Security Analyst turn to apply NLP to domain names ? There was a need to examine the lexical features, i.e. textual properties of the URL itself which is clearly not possible manually by human eye.
A quick academic literature survey of DGA domain deep learning revealed use of WHOIS , DNS queries/response as side information to further characterize if a domain name was malicious or benign. Some researchers identified novel features such as internet connection speed to the DNS server, TTL and other features as indicators of malicious domains as bad actors are known to host DNS servers or C2 infrastructure on residential ISP networks known to be running on low bandwidth networks with irregular TTL . Some researchers advised to explore n-gramming the url sld. Some research applied DNS2Vec and Word2Vec for classifying DGA domain names. I hope to continue experimenting with these methods to understand which method yeilds least false negative.
The path of least resistance was the beaten well known path for Security Analyst - develop a script to perform whois and dns lookups to obtain side information of false negative url and manually explore the false negative data. This method appeared trivial to increase knowledge of false negative data, however absent domain registration, WHOIS and DNS lookups are meaningless - the domain is non existent. There is nothing to lookup or query.
I proceeded to take a small sample of false negative data to manually test for domain registrant and DNS response only to find that all of the Necurs DGA domain were placed under the administrative control of Microsoft. This was not surprising given the under lying legal action and Courts order authorizing Microsoft to assume control of the Necurs bot infrastructure.
Whilst the machine learning experiment thus far appeared useful in predicting a given DGA or non DGA domain is based on previously learned data sets (training data), academic publications reviewed during the course of this experiment highlighted some areas of concern. Some peer reviewed and published researchers indicated that systems that rely on prior training can be evaded by DGA bots if the underlying algorithm or the learned features change.
RESULTS
After training, testing,retraining, re-testing, I conducted a broad stroke testing of the model across all available DGA malware families that I could lay my hands on, approximately 44 families. The results of the test convinced me beyond any reasonable doubt that the technology is way too powerful not to be leveraged in day to day SOC operations.
I tested 44 families of DGA malware. Of the 44 families, 29 families were detected with 100% accuracy and 15 DGA families could not be detected with 100% accuracy. SuppoBox [ see left] and Matsnu[ below] showed 100% error. These domain names were formed only using dictionary words.
Further exploration, thoughts and ideas generated
At the conclusion of the PoC, I was perplexed/ There are more questions, and several avenues to explore:
- There is no single, simple or straightforward way to choose an ML algorithm. If the end goal is to minimize false negative to zero (ROC =1), what algorithm(s) could be chosen ?
- What role do domain experts play when machines can identify interesting features from data sets. Is it more efficient to use machines to perform automated feature engineering?
- What kind of tool chain, infrastructure, instrumentation should one prepare to execute on such projects ? Which packages are good for data exploration and analysis ? What tech stack would be optimal ? This experiment was run on AWS.
- How does one continually monitor ML model accuracy ? Model accuracy drifted at various times during this experiment. What tools and approaches are needed to monitor ML models ?
- Should such security/threat intelligence predictive analytics tool be threat modeled to detect/prevent model tampering ? Alexa 1M from open internet was used to train model. Some open data was used to test. What if someone co-mingles such open sources with garbage ?
- At what point should such exercise focus on improve algorithms and when should focus shift on putting better data into the algorithms given the missions success criteria is zero error.
- How does one isolate data security ( or privacy) from ML model secrecy ? Data owners or ML model users may not wish to share the Malware domain names or the data may be sensitive. How can data and ML model be isolated from each other ?
Real World Applications and Opportunity
This PoC highlights the problem of false negative data exploration challenges for Information Security practitioners applying predictive analytics in threat prediction. Some of the issues discussed and identified could be used to develop either a better algorithm or a better process to explore false negative data, perhaps using domain expert in the middle.
Traditional Cyber-security teams may not have all the tools, instrumentation for conducting ad-hoc deep learning experiments. Consumers of products like Splunk Machine Learning Tool Kit report difficulty in feeding data to Splunk MLKT and operationalizing Splunk MLTK for DGA domain prediction. These are some of the barriers in operationalizing machine learning in a SOC environment.
To operationalize ML for cyber security, one idea that comes to mind is to develop a centrally maintained [ owned] custom ML powered API for DGA prediction. Multiple models could be developed updated and made available to enterprise users for predictive analytics via REST API. For example, a custom Splunk lookup command can further ease Security analysts to perform look ups against DNS/URL data. Additional reputation, whois, nxdomain, dns response data can be fused to such an infrastructure to increase prediction accuracy and risk ranking.
Recommendations
Machine learning technology promise to transform threat detection and threat hunting at scale, Security Analysts should not get disillusioned with predictive analytics powers of Machine Learning gizmos. One caveat practitioners should be aware is the complex problem of false negative data.
Several dependencies impact accuracy and reliability of a ML system, including choice of ML algorithm , training data and other aspects that may be beyond the understanding of the Security Analysts, hence the need to apply human intelligence and knowledge to verify the predictions, alternatively design a system where human expert can provide inputs to correct/improve machine intelligence.
Certain network or application security machine data exploration problems may require inter-disciplinary teaming comprising of applied mathematicians , Statisticians, Data Science, Computational Linguistics, Information Science, Cyber Security Science to meet the precision and scientific rigor required in targeted threat hunting at scale with data involving natural language.
The complex problem of false negative data exploration in a ML powered SOC environment is exacerbated by overheads of ML technology itself. ML technology is as good as the data that is fed to it and the level and state of model maintenance. Some of these challenges could contribute to false negative data and go unnoticed. Hence, to minimize false negatives, human in the ML loop may be an option, further organizations could create false-negative data exploration focused automated or semi-automated workflows.
SOC Analysts and SOC Engineers are encouraged to get familiar with applied aspects of deep learning technologies for defensive cyber missions.
Acknowledgement
I thank John Bambenek for Malware DGA domain data and engaging discussions.
References
The news article that prompted this stoic pursuit.
Good start mate
Threat Researcher and Provider of Threat Intelligence Feeds | PhD Research in AI/ML Cybersecurity and Security of AI/ML
4 年One of the things I've done in my research to tackle uncertainty and false negatives is to allow my algorithms to have an "uncertain" classification. Mathematically, while you have results given in a binary form (true/false), the algorithms have confidence scorings. I created a malicious domain classifier based on presence of various characteristics (IPv6 records, SPF/DKIM records, how old it was in passive DNS, etc), and if I only took data where the algorithm was 90% confidence, very few records were left on the table and it was over 99% accurate. More importantly, it's those uncertain records that you can dig into, find other features, and see if you can find something that can help better distinguish.