登录查看更多内容

Clustering IOCs

?? Nir Yosha

Lover of all things security | 20+ years experience

发布日期: 2018年8月3日

There is something so absolutely freeing about staring at the stars. The milky way stars can be seen without a telescope. However, most stars are not visible to the naked eye. Star clusters can be discovered by their gravitational influence. For example, the center of a cluster could be detected when stars are circling around as if they were orbiting a really dense mass.

Honestly, Indicators of compromise (IOCs), are not as enjoyable to stare at as stars. However, similarly to stars, the cyber threat intelligence (CTI) landscape offers millions of IOCs that may or may not impact each other. While grouping IOCs can be seen as a pointless task at first glance, clustering IOCs can automate grouping and reveal valuable insights to the CTI, SOC and IR teams.

Clustering IOCs was the subject of my research and the topic of this article.

What's in it for me?

IOCs, especially when coming from OSINT, are very noisy and unmanageable without some pre-processing. In addition to false positives, IOCs are often missing context, which makes them difficult to act on. Finally, priority of CTI risk is highly depending on likelihood and relevance of the IOCs to our organization and environment.

Background research

For my research, I used OSINT feeds, malware analysis reports and threat intelligence research papers.

To better understand the IOCs meaning and correlation to an intrusion set, I have looked at the most common tactics, techniques and procedures (TTPs) used by adversaries and within cyber campaigns.

IOCs types and IOCs enrichment

For my research I focused on three main IOC types - malicious hosts (IPv4), domains and malware samples (MD5).

Image 1: Example of CTI report and its extracted IOCs. The main IOC types are malicious hosts (IPv4), Domains and malware samples (MD5 hash).

Source - threatminer.org

In order to improve the feature selection process I used the following for data enrichments.

Malicious Host (IPv4)- geolocation (country, region, city), service provider, domain, ASN

Domains - top-level-domain, length, owner, application (eg. php, java etc.), target

Malware samples - malware type, hash value, file size, file name, target

Note - I have used statistical techniques when dealing with missing data, since dropping samples can bias the results of the analysis.

Assembly

The true power of IOCs is realized when they are used in conjunction with each other to better enable contextual understanding for threat analysis. For the clustering algorithm I have defined a cluster object as an assembly.

Assembly (definition): Assembly is a group of IOCs and their extracted features that relate to each other either by association or composition. For example, a link between IOCs to a specific APT will be defined an assembly by association, where a link between IPs to their domain lookup will be defined assembly by composition. Assembly can include both association and composition relationship.

Image 2: Example of assembly. Assembly can be illustrated in a node graph where the IOCs are the nodes and the relationship types are the edges.

Source: sqrrl.com

IOCs assembly could be a result of malware analysis report. As an examples linkage could be based on compilation timestamp, first seen timestamp, c2 traffic or http header.

Image 3: Example of assembly by association based on malware analysis report.

Source: cuckoosandbox.org

Distance between assemblies

The appropriate distance function between assemblies depends on the individual data set and intended use of the results.

IPv4 distance - distance between IPv4 geolocation (latitude and longitude) doesn't work well, due to countries gap in size and scale. A better distance measure should consider IP subnets. For example 192[dot]168.3.5 is closer to 192[dot]168.7.9 than to 192[dot]166.3.5 based on the mutual prefix (from left) between the IPs.

Image 4: 3-D scatter plot where axis X,Y show octets 3,4 value and Z shows octet 1 and 2 combined. Octet 3 and 4 do not contribute to IPv4 clusters.

Source: Nir Yosha

Domains distance - for my research, I looked at two types of malicious domains: domain generator algorithm (DGA) and misspelled domains. DGA distance from other domains could be defined by their character's length. DGA domains are distinctively longer than other domains. For other domains, such as misspelled domains a distance between strings can be used. There are few ways to measure distance between strings. In my research I used the Levenshtein distance. Levenshtein distance measured by the number of character deletions, insertions, or substitutions required to transform domain string A into domain string B.

Image 5: DGA domains length is higher than other domains in average. Other domains are likely related based on Levenshtein distance. In addition, top-level-domain (TLD) can be used as a classifier (.org, .biz, .ru etc.)

Source: Nir Yosha

Malware distance - Since MD5 hash has no fuzzy characteristics, distance can not be based on similarities. Based on enrichment on malware type and targets, dummy coding distance can be used by category (same = 0, different = 1).

Image 6: An example of malware types by distribution. The categories has discrete distance. Likelihood could be used as "risk distance" (see next paragraph).

Source: pandasecurity.com

Risk based clustering

In order to help with prioritizing the assembly clusters, we can look at new dimensions that relate to risk: likelihood and relevance.

Likelihood - As an example, we can define likelihood based on the entire dataset:

P(Assembly) - a likelihood score (not actual probability) based on frequency tables

W(i) - weights per criteria based on domain knowledge or TAE

P(location) - likelihood score based on IP geolocation (country/city)

P(owner) - likelihood score based on ASN and service provider

P(malware) - likelihood score based on malware distribution

Relevance - As an example, we can define relevance based on the organization's priorities:

R(Assembly) - criteria score based on organization assets and vulnerabilities

W(i) - weights per IOC feature based on domain knowledge or TAE

R(industry) - score by industry relevance (eg. pharmacy, healthcare, finance)

R(platform) - score by platform relevance (eg. windows, linux, android)

R(application) - score by application relevant (eg. oracle, office, apache)

Image 7: Example for likelihood score by owner based on IOCs frequency table.

Source: Nir Yosha

Conclusion

Unlike any other domain, threat intelligence datasets are a fast moving target. Dataset poisoning, evading techniques, obfuscating, jumping IPs and DGA are just few examples of what any machine learning model should deal with.

Nevertheless, working on datasets that focus on well defined period of time and business requirements can result with valuable insights:

Clusters by risks - based on likelihood and relevance score, clusters will likely help with priorities.

Course of actions - based on similar attack vectors and related malware families, clusters will likely require similar mitigations.

Attribution - based on similar source based characteristics and motivation, clusters will likely help with attribution.

Machine learning for CTI is only in its inception. CTI datasets keep on growing exponentially. Clustering and other ML domains will keep on playing a role in understanding threats in order to better protect our organizations.

Keep your eyes on the stars, and your feet on the ground.

要查看或添加评论，请登录

?? Nir Yosha的更多文章

Lateral movement prevention

2019年5月16日

Lateral movement prevention
My quest for identity in a vendor turmoil

2018年11月9日

My quest for identity in a vendor turmoil

1 条评论
Adversarial machine learning

2018年8月20日

Adversarial machine learning

Remember, building robots is extremely dangerous and should not be attempted without great care. When you enter, you…
Make threat intelligence actually work

2018年5月21日

Make threat intelligence actually work
Statistics and Threat intelligence

2018年1月29日

Statistics and Threat intelligence
Threat Intel Analysis of Ukrainian's Power Grid Hack

2018年1月14日

Threat Intel Analysis of Ukrainian's Power Grid Hack
Interview with ITPRO at BSides Delaware

2017年11月29日

Interview with ITPRO at BSides Delaware

3 条评论
BSides Boston - Threat Intel in Numbers

2017年5月11日

BSides Boston - Threat Intel in Numbers
Threat Intelligence in Numbers

2017年1月15日

Threat Intelligence in Numbers

As we exit 2016, I try to look at threat intelligence numbers and show how 2017 and beyond will turn threat…

3 条评论
Ukraine's Power Grid Hack

2016年12月4日

Ukraine's Power Grid Hack

See all articles

Clustering IOCs

?? Nir Yosha

Lover of all things security | 20+ years experience

?? Nir Yosha的更多文章

社区洞察

其他会员也浏览了

The FLINT Report: March 29 | A Global Pulse on Threat Intelligence

AI and Cybersecurity: What You Need To Know From Singapore's AI Security Summit

March Edition

AI- A passage for Hackers

OSINT Overview

Transforming the Business Threat Landscape with AI

Telegram&Open Source Investigations

AI-Driven Malware: A Rising Cybersecurity Threat

AI-Led Active Exploit Report

Advanced Threat Intelligence: Navigating the Cacophony for Better Decision-Making

?? Nir Yosha的更多文章

Lateral movement prevention

My quest for identity in a vendor turmoil

Adversarial machine learning

Make threat intelligence actually work

Statistics and Threat intelligence

Threat Intel Analysis of Ukrainian's Power Grid Hack

Interview with ITPRO at BSides Delaware

BSides Boston - Threat Intel in Numbers

Threat Intelligence in Numbers

Ukraine's Power Grid Hack

社区洞察

其他会员也浏览了

The FLINT Report: March 29 | A Global Pulse on Threat Intelligence

AI and Cybersecurity: What You Need To Know From Singapore's AI Security Summit

March Edition

AI- A passage for Hackers

OSINT Overview

Transforming the Business Threat Landscape with AI

Telegram&Open Source Investigations

AI-Driven Malware: A Rising Cybersecurity Threat

AI-Led Active Exploit Report

Advanced Threat Intelligence: Navigating the Cacophony for Better Decision-Making