登录查看更多内容

Big Data Algorithms in Cybersecurity

Dipto Chakravarty

Chief Product Officer at Cloudera | Ex Amazon | Board Member | CPO | CEO

发布日期: 2017年2月3日

This post discusses few specific elements of supervised machine learning in 5 steps for categorizing threat vectors in 4 ways using the 3 V’s of threat analytics against 2 types of common threats. This year, it is about Big Algorithms (less about Big Data) working in concert with the right machine learning models that train the myriad security systems to detect, correct, respond and remediate threats.

Supervised machine learning is to network data as human learning is to high school experience. It is about getting our computers to act instead of being just programmed. We interpolate and extrapolate from past experiences to deal with unfamiliar situations. We use an ensemble of network generated data (e.g., pcap), computer generated data (e.g., syslog) and user app generated data (e.g., session logs) to make decisions that leads to taking actions. Our decisions are descriptive, diagnostic or predictive depending on the prior knowledge of “what happened”, “why did it happen” or “what will happen”. Machine learning (ML) with data analysis, models this behavior at massive scales, and it has significant applicability in the field of digital forensics, cybersecurity and information assurance.

Human learning is what we understand best and it continues to be the best form of learning. One way to look at machine learning is human-level artificial intelligence that can be applied broadly.

5-Step Supervised Learning

Machine learning, like human learning, is a sequential process.
Define the problem – eg, determine a domain to be safe versus malicious
Harvest the data set – eg, train the pertinent data set to build better models
Create a capability – eg, build new capability to detect novelty data patterns
Validate the model – eg, use multiple capabilities created to form a model
Operationalize – eg, use this model to make decisions; through continuous use this model gets trained (and becomes smarter to make predictions over time).

Supervised machine learning, together with data science and big data proven its success in recommendation systems (like Amazon and Netflix) to voice recognition systems (like Apple Siri and Microsoft Cortana) to many other applications, and is now commonly used in cybersecurity to hunt threats.

In cybersecurity and information assurance, we need data models to have predictive power to implicitly distinguish between normal benign network traffic and abnormal, potentially malicious traffic that can be an indicator of a cyber-attack vector. This is where machine learning is being extensively used to build classifiers, as the goal of the models is to provide a binary response (e.g., good or bad) to the network traffic being analyzed.

4 Phases of Learning

Supervised machine learning has gone through four phases of evolution: Collect / Analyze / Predict / Prescribe. These steps were initially in silos in cybersecurity because these ecosystems were built from the bottom up — experimenting with data, tools and choices — and building a set of practices and competencies around these disciplines.

The question is often asked how much data is enough. To give an idea of how much data needs to be processed, a medium–size network with 20,000 devices (servers, laptops, phones) transmit more than 50 TB of data in a 24–hour period. That means that over 5 GB of it must be analyzed every second to detect cyber-attacks, targeted threats and malware attributed to malicious users. While dealing with such volumes of data in real time poses difficult challenges, so one has to models that can detect cyber-attacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).

3 V’s of Cybersecurity

Big Data is being created at the rate of two quintillion bytes daily. So, it is hard enough to find the haystack, let alone the needle in the haystack. Describing big data in a cybersecurity context consists of the following ten common sensor sources:

1. alerts,

2. events,

3. logs,

4. pcaps,

5. network flow,

6. threat feeds,

7. DNS captures,

8. web page text,

9. social activity,

10. audit trails.

Finding the patterns to describe big data analytics in a cybersecurity context has to mention the three V's: Volume, Variety and Velocity.

Volume:

Large quantities of data are necessary to build and test the models. The question is when is "large" large enough? Sample sizes are never large. If N (the sample size) is too small to get a sufficiently precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data.

Variability:

In applications of big data there are two types of data available: structured data versus unstructured data. For cybersecurity-specific data science models, Variability refers to the range of values that a given feature could take in a data set. The importance of having data with enough variability in building cyber security models is often underestimated. Network deployments in organizations – businesses, government agencies and private institutions – vary greatly. Commercial network applications are used differently across organizations and custom applications are developed for specific purposes. If the data sample on which a given model is tested lacks variability, the risk of an incorrect assessment of the model’s performance is high. If a given machine learning model has been built properly (e.g., without "overtraining", which happens when the model picks up very specific properties of the data on which it has been trained), it should be able to generalize to "unseen" data.

Velocity:

If one has to analyze hundreds of millions of records and every single query to the data set requires hours, building and testing models would be a cumbersome and tedious process. Being able to quickly iterate through the data, modify some parameters in a particular model and quickly assess its performance are all crucial aspects of the successful application of data science techniques to cyber security.

Thus, volume, variability and velocity are essential characteristics of big data sets that have high relevance for applying data science to cyber security. Together these characteristics increase the "value" of data in data science for cyber security.

2 Types of Cyber Battles

Threats evolve every single day. As attack surfaces increase in business infrastructures, so does the diversity of cyber-attacks. The two broadest types of threats are

1. outside-in attacks, and

2. inside-out attacks.

In both types of threats, a combination of machine-based and human-based inputs are required before making a decision and taking an action. This is why the bad guys tend to win while the good guys (defenders) are analyzing the threat vectors.

Analytical tools, in widespread use today, are categorized into three groups based on its sophistication and ability to emulate the human brain of a trained infosec analyst.

Basic-level descriptive analytics, i.e., “what happened” – 25% ML-based finding, 75% reliance on human analyst.

Intermediate-level diagnostic analytics, i.e., providing context to “why did it happen” – 50% ML-based finding; 50% reliance on human analyst.

Advanced predictive analytics, i.e., “what is likely to happen” – 75% ML-based finding; 25% reliance on human analyst.

1 Way to Secure Your Assets

The network defense of the future will consist of analytics-enhanced human operators interacting with the network. However, until then, one has to rely on ML plus humans to combat the threats.

Supervised ML is rapidly training computers (like we train humans) to create batter mouse traps for advanced threat vectors. As attack surfaces increase, so will the diversity of cyber-attacks. This post discussed the cybersecurity-specific basics of supervised machine learning (in 5 steps) to categorize the threats (in 4 ways) by understanding the 3 V’s of threat analytics to safeguard the business against 2 types of common threats. At the end of the day it is about big algorithms (less about big data) working in concert with the right machine learning models to train the system to detect, correct, respond and remediate the most advanced threats.

Michael Itegbe

Cyber Security & Cloud Computing||Microsoft Azure ||AWS||Acronis||Educator

3 年

Clean read..very captivating?

1 次回应

Jean-Nicolas Piotrowski

7 年

Great read, Dipto ! This is the exact logic that pushed my company to develop Reveelium, a intelligent machine learning-based solution, that can be added on top of a SIEM and give an existing SOC its cutting edge. The aim for us vendors should now shift to easing the work of operators and not let them drown in a sea of alerts and false-positives.

1 次回应

Venky Karukuri

#softwarearchitect #securityevangelist #teacher #author #investor #fitness #arenapolo

8 年

Very Supervised Intro - A good and clean read.

BK Gogia

8 年

Great article.

查看更多评论

要查看或添加评论，请登录

Dipto Chakravarty的更多文章

YR1 at Cloudera

2024年10月17日

YR1 at Cloudera

Today marks my 1YR anniversary at Cloudera. The team and I have been mighty busy over the last year that seems like a…

32 条评论
ISSA seminar on Identities as the new perimeter in the Zero Trust world

2019年9月18日

ISSA seminar on Identities as the new perimeter in the Zero Trust world

Enjoyed moderating ISSA's Thought Leadership panel on how identity and access are the perimeters in a Zero-Trust world…

1 条评论
Which of these 10 Authentication Options Best Secure Your Data on Your Device? Just type one word answer ...

2019年8月26日

Which of these 10 Authentication Options Best Secure Your Data on Your Device? Just type one word answer ...

Rollup of password-less authentication startups into mother ship is a loud and clear signal of the password-less…

7 条评论
Password-less Authentication

2019年6月12日

Password-less Authentication

Enjoyed moderating today's ISSA Thought Leadership Series with Stephen Cox, Michael McKenzie and Sean Bakke, sharing…
Zero Trust: The Evolution of Perimeter Security

2019年5月17日

Zero Trust: The Evolution of Perimeter Security

Enjoyed participating in today's ISSA Thought Leadership Series with Jorge Orchilles, Faraz Siddiqui and Sean Bakke…
The Frontier for Maliciousness in IPv6

2019年5月9日

The Frontier for Maliciousness in IPv6

Enjoyed participating in today's ISSA Thought Leadership Series with Mike Levin, Chad Anderson and Sean Bakke, sharing…
Top 5 Artificial Intelligence Areas Attracting PE and VC Investors

2017年12月20日

Top 5 Artificial Intelligence Areas Attracting PE and VC Investors

At the recent MIT Enterprise Forum, JT Kostman, David Bray, Peter Bronez, Marco Rubin and I were on a panel, discussing…

1 条评论
Identity of Smart Things

2017年10月25日

Identity of Smart Things

At last week’s NVTC Cybersecurity Committee, Andrew Girson, Dan Caprio, Joe Klein and I were on a panel, and we…

1 条评论
Swipe Instead of Pin and Password for Smart Things

2017年6月16日

Swipe Instead of Pin and Password for Smart Things

At next week's Live Slam '17 Internet of Things Conference in RTP, NC, we will be presenting on how to secure swipes…
The Big Deal about Big Data in Security – 7 Facts

2017年3月17日

The Big Deal about Big Data in Security – 7 Facts

Big Data is a household term with every single business dabbling in it. One of its enablers is the rise of machine to…

See all articles

Big Data Algorithms in Cybersecurity

Dipto Chakravarty

Chief Product Officer at Cloudera | Ex Amazon | Board Member | CPO | CEO

5-Step Supervised Learning

4 Phases of Learning

3 V’s of Cybersecurity

Volume:

Variability:

Velocity:

2 Types of Cyber Battles

1 Way to Secure Your Assets

Dipto Chakravarty的更多文章

社区洞察

其他会员也浏览了

Navigating the Future: Top Skills Employers Look for in the IT Industry in 2024

Shadow AI: The Hidden Threat to Cybersecurity

Hacking AI systems, and securing them.

Introduction to Microsoft Security Copilot

Predictive Analytics: Staying One Step Ahead of Cybercriminals

Adversarial Machine Learning: machine learning’s tryst with IT security

OffSec EVOLVE APAC Takeaway | Harnessing AI and Machine Learning in Cybersecurity by Faisal Yahya

Adversarial machine learning and data poisoning

Securing A.I.: Understanding the Top 10 Machine Learning Attacks

Enhancing Security in AI Applications: Strategies and Best Practices

5-Step Supervised Learning

4 Phases of Learning

3 V’s of Cybersecurity

Volume:

Variability:

Velocity:

2 Types of Cyber Battles

1 Way to Secure Your Assets

Dipto Chakravarty的更多文章

YR1 at Cloudera

ISSA seminar on Identities as the new perimeter in the Zero Trust world

Which of these 10 Authentication Options Best Secure Your Data on Your Device? Just type one word answer ...

Password-less Authentication

Zero Trust: The Evolution of Perimeter Security

The Frontier for Maliciousness in IPv6

Top 5 Artificial Intelligence Areas Attracting PE and VC Investors

Identity of Smart Things

Swipe Instead of Pin and Password for Smart Things

The Big Deal about Big Data in Security – 7 Facts

社区洞察

其他会员也浏览了

Navigating the Future: Top Skills Employers Look for in the IT Industry in 2024

Shadow AI: The Hidden Threat to Cybersecurity

Hacking AI systems, and securing them.

Introduction to Microsoft Security Copilot

Predictive Analytics: Staying One Step Ahead of Cybercriminals

Adversarial Machine Learning: machine learning’s tryst with IT security

OffSec EVOLVE APAC Takeaway | Harnessing AI and Machine Learning in Cybersecurity by Faisal Yahya

Adversarial machine learning and data poisoning

Securing A.I.: Understanding the Top 10 Machine Learning Attacks

Enhancing Security in AI Applications: Strategies and Best Practices