Big Data Algorithms in Cybersecurity
Dipto Chakravarty
Chief Product Officer at Cloudera | Ex Amazon | Board Member | CPO | CEO
This post discusses few specific elements of supervised machine learning in 5 steps for categorizing threat vectors in 4 ways using the 3 V’s of threat analytics against 2 types of common threats. This year, it is about Big Algorithms (less about Big Data) working in concert with the right machine learning models that train the myriad security systems to detect, correct, respond and remediate threats.
Supervised machine learning is to network data as human learning is to high school experience. It is about getting our computers to act instead of being just programmed. We interpolate and extrapolate from past experiences to deal with unfamiliar situations. We use an ensemble of network generated data (e.g., pcap), computer generated data (e.g., syslog) and user app generated data (e.g., session logs) to make decisions that leads to taking actions. Our decisions are descriptive, diagnostic or predictive depending on the prior knowledge of “what happened”, “why did it happen” or “what will happen”. Machine learning (ML) with data analysis, models this behavior at massive scales, and it has significant applicability in the field of digital forensics, cybersecurity and information assurance.
Human learning is what we understand best and it continues to be the best form of learning. One way to look at machine learning is human-level artificial intelligence that can be applied broadly.
5-Step Supervised Learning
- Machine learning, like human learning, is a sequential process.
- Define the problem – eg, determine a domain to be safe versus malicious
- Harvest the data set – eg, train the pertinent data set to build better models
- Create a capability – eg, build new capability to detect novelty data patterns
- Validate the model – eg, use multiple capabilities created to form a model
- Operationalize – eg, use this model to make decisions; through continuous use this model gets trained (and becomes smarter to make predictions over time).
Supervised machine learning, together with data science and big data proven its success in recommendation systems (like Amazon and Netflix) to voice recognition systems (like Apple Siri and Microsoft Cortana) to many other applications, and is now commonly used in cybersecurity to hunt threats.
In cybersecurity and information assurance, we need data models to have predictive power to implicitly distinguish between normal benign network traffic and abnormal, potentially malicious traffic that can be an indicator of a cyber-attack vector. This is where machine learning is being extensively used to build classifiers, as the goal of the models is to provide a binary response (e.g., good or bad) to the network traffic being analyzed.
4 Phases of Learning
Supervised machine learning has gone through four phases of evolution: Collect / Analyze / Predict / Prescribe. These steps were initially in silos in cybersecurity because these ecosystems were built from the bottom up — experimenting with data, tools and choices — and building a set of practices and competencies around these disciplines.
The question is often asked how much data is enough. To give an idea of how much data needs to be processed, a medium–size network with 20,000 devices (servers, laptops, phones) transmit more than 50 TB of data in a 24–hour period. That means that over 5 GB of it must be analyzed every second to detect cyber-attacks, targeted threats and malware attributed to malicious users. While dealing with such volumes of data in real time poses difficult challenges, so one has to models that can detect cyber-attacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).
3 V’s of Cybersecurity
Big Data is being created at the rate of two quintillion bytes daily. So, it is hard enough to find the haystack, let alone the needle in the haystack. Describing big data in a cybersecurity context consists of the following ten common sensor sources:
1. alerts,
2. events,
3. logs,
4. pcaps,
5. network flow,
6. threat feeds,
7. DNS captures,
8. web page text,
9. social activity,
10. audit trails.
Finding the patterns to describe big data analytics in a cybersecurity context has to mention the three V's: Volume, Variety and Velocity.
Volume:
Large quantities of data are necessary to build and test the models. The question is when is "large" large enough? Sample sizes are never large. If N (the sample size) is too small to get a sufficiently precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data.
Variability:
In applications of big data there are two types of data available: structured data versus unstructured data. For cybersecurity-specific data science models, Variability refers to the range of values that a given feature could take in a data set. The importance of having data with enough variability in building cyber security models is often underestimated. Network deployments in organizations – businesses, government agencies and private institutions – vary greatly. Commercial network applications are used differently across organizations and custom applications are developed for specific purposes. If the data sample on which a given model is tested lacks variability, the risk of an incorrect assessment of the model’s performance is high. If a given machine learning model has been built properly (e.g., without "overtraining", which happens when the model picks up very specific properties of the data on which it has been trained), it should be able to generalize to "unseen" data.
Velocity:
If one has to analyze hundreds of millions of records and every single query to the data set requires hours, building and testing models would be a cumbersome and tedious process. Being able to quickly iterate through the data, modify some parameters in a particular model and quickly assess its performance are all crucial aspects of the successful application of data science techniques to cyber security.
Thus, volume, variability and velocity are essential characteristics of big data sets that have high relevance for applying data science to cyber security. Together these characteristics increase the "value" of data in data science for cyber security.
2 Types of Cyber Battles
Threats evolve every single day. As attack surfaces increase in business infrastructures, so does the diversity of cyber-attacks. The two broadest types of threats are
1. outside-in attacks, and
2. inside-out attacks.
In both types of threats, a combination of machine-based and human-based inputs are required before making a decision and taking an action. This is why the bad guys tend to win while the good guys (defenders) are analyzing the threat vectors.
Analytical tools, in widespread use today, are categorized into three groups based on its sophistication and ability to emulate the human brain of a trained infosec analyst.
Basic-level descriptive analytics, i.e., “what happened” – 25% ML-based finding, 75% reliance on human analyst.
Intermediate-level diagnostic analytics, i.e., providing context to “why did it happen” – 50% ML-based finding; 50% reliance on human analyst.
Advanced predictive analytics, i.e., “what is likely to happen” – 75% ML-based finding; 25% reliance on human analyst.
1 Way to Secure Your Assets
The network defense of the future will consist of analytics-enhanced human operators interacting with the network. However, until then, one has to rely on ML plus humans to combat the threats.
Supervised ML is rapidly training computers (like we train humans) to create batter mouse traps for advanced threat vectors. As attack surfaces increase, so will the diversity of cyber-attacks. This post discussed the cybersecurity-specific basics of supervised machine learning (in 5 steps) to categorize the threats (in 4 ways) by understanding the 3 V’s of threat analytics to safeguard the business against 2 types of common threats. At the end of the day it is about big algorithms (less about big data) working in concert with the right machine learning models to train the system to detect, correct, respond and remediate the most advanced threats.
Cyber Security & Cloud Computing||Microsoft Azure ||AWS||Acronis||Educator
3 年Clean read..very captivating?
Great read, Dipto ! This is the exact logic that pushed my company to develop Reveelium, a intelligent machine learning-based solution, that can be added on top of a SIEM and give an existing SOC its cutting edge. The aim for us vendors should now shift to easing the work of operators and not let them drown in a sea of alerts and false-positives.
#softwarearchitect #securityevangelist #teacher #author #investor #fitness #arenapolo
8 年Very Supervised Intro - A good and clean read.
Chief Technology Strategy Officer | Chief Product Officer | Generative AI / ML | Transformational & Cybersecurity Leadership | Digital Innovation | Healthcare Innovation | Web3 | Blockchain
8 年Great article.