Graphing the Future of Cybersecurity
Guarding Against the Surge
Cybercrime stands out as one of the most pressing challenges confronting enterprises, ranking among the fastest-growing crimes globally. An array of compelling facts and statistics underscores the urgency for organizational leaders to adopt a proactive approach to cybersecurity.
The cyber network encompasses diverse infrastructures, upper-level applications, end-point clients and users, forming a complex, interconnected environment. In the arms race between attackers and defenders, adversaries often bypass existing detection mechanisms by discovering new attack vectors, while defenders are tired of addressing ceaseless vulnerabilities. Therefore, practitioners have to reconsider traditional approaches and focus on devising more resilient strategies for countering both known and unforeseen threats.
If it were measured as a country, cybercrime would be the world’s third-largest economy after the U.S. and China. (Cybersecurity Ventures, 2022)
Many advancements in the realm of cybersecurity have gravitated towards graph technology, drawn by its innate ability to preserve relationships and deliver exceptional performance in terms of both efficiency and scalability. Although achieving 100% security remains challenging, the incorporation of flexible and powerful graph queries, algorithms, and eXplainable AI techniques has markedly bolstered the cybersecurity posture of organizations.
Addressing the Pain Points
Security data, such as logs and records, often arrives in both sizable and unstructured formats. Handling such extensive unstructured data with tabular-oriented tools yields suboptimal results. While it's possible, it comes at cost: integrating new data sources becomes challenging, data structuring and querying grow complex, and query performance suffers.
In a word, the primary hurdle in cybersecurity analysis isn't a dearth of available data but rather the consolidation of diverse information from myriad sources into a cohesive model. This approach enriches the comprehension of the cybersecurity landscape and supports real-time decision-making. Traditional relational databases prove inadequate for this task.
Graph Technology for Cybersecurity
Graph databases like Ultipa streamline the storage, querying, and analysis of unstructured data, even as data volumes quickly expand. In the context of cybersecurity, nodes represent entities like system objects, products, vulnerabilities, devices and users, while edges signify the relationships, interactions or the flows connecting these entities.
We will delve into several concrete scenarios to illustrate the transformative power of graph technology in bolstering cybersecurity measures.
Scenario 1: Cyber-attack Identification on Network Flow
Background
From a security perspective, the vast majority of devices within organizations are connected and communicate over a network. By monitoring network traffic, we can extract network flow records, typically collected by devices like routers. These records capture sequences of packets exchanged between internet hosts within specific time intervals.
In recent times, security analysts have recognized the significant potential of network flow data as a robust foundation for security frameworks. It empowers the detection of suspicious or malicious activities that could indicate a cyber-attack.
Let's use bi-directional network flows as an example. In a bi-directional flow, data packets are simultaneously exchanged between two devices — from the source device to the destination device and vice versa. Such bi-directional flows are common in various network communication settings like web browsing, email exchange, and real-time chat applications. For instance, when you request a webpage from a web server, the server responds by sending data back to your device, establishing a bi-directional flow.
Graph-based Modelling
Leveraging graph-based models for network flow data [1] offers a holistic view of the network (Figure 2). To begin, we establish each bi-directional network flow record as this: a flow edge connects the source and destination device nodes, with the direction from the source to the destination (Figure 1).?
Some properties are associated with each node and edge:
Flow-based Intrusion Detections
Flow-based techniques offer several advantages and are successful at detecting various malicious network behaviors. When communication takes place with amounts of packets and bytes moved, this information is sufficient for detecting many attacks, especially when backed by a graph database.
For example, a DDoS (Distributed Denial of Service) attack is a malicious attempt to disrupt the functioning of a targeted service by overwhelming it with a flood of internet traffic, making it unable to respond to legitimate requests. Thus, when a device receives an unusually high volume of packets or bytes within a specified short time period, it can be an indicator of potential DDoS attacks.
Setting up a DDoS alarm can be achieved with ease through the Degree Centrality algorithm. Below is the UQL to compute the in-degree of nodes weighted by the edge property srcPackets within a specific timeframe. It identifies devices that have received a sum of packets surpassing the predefined threshold of 10,000 during this period (Figure 3).
algo(degree).params({
direction: 'in',
edge_schema_property: 'srcPackets'
}).edge_filter({time <=> [5378315,5378530]}).stream() as re
where re.degree > 10000
find().nodes({_uuid == re._uuid}) as dev
return table(dev._id, re.degree)
Port scanning is a network reconnaissance technique used by attackers to discover open ports and services on a target system. It’s an essential step in the early stages of many cyberattacks, as it helps attackers identify potential vulnerabilities. Port scanning activities can be detected by monitoring network traffic for suspicious patterns, such as a high rate of connection attempts to different ports of a device from a single source.
With UQL, port scanning detection can be accomplished through a very basic pathfinding template written as n().re().n(), where n() stands for nodes and re() signifies outgoing edges. We set a filter for the edges specifying a time range. After grouping the query results by the source and destination devices, we count the distinct dstPort values and return results where the number of distinct ports exceeds 100 (Figure 4). This approach allows for the identification of potentially malicious port scanning behavior in the network traffic data.
n(as src).re({time <=> [5380630,5380690]} as com).n(as dst)
group by src, dst
with count(distinct(com.dstPort)) as portNo
where portNo > 100
return table(src._id, dst._id, portNo)
Scenario 2: User Abnormal Behavior Detection
Background
Abnormal user or account behaviors encompass a range of activities such as excessive login failures, anomalous access times, privilege escalation, abnormal data access and transfers. These deviations can serve as indicators of potential network attacks or unauthorized access.
Many organizations employ Red Teams to pinpoint vulnerabilities and scrutinize their security posture. Red teams utilize diverse tactics, techniques, and procedures (TTPs) to simulate real-world cyber threats. Their intent is not to cause harm but to furnish actionable feedback for the enhancement of detecting, countering, and mitigating security threats. Following a red team exercise, the organization's security team can leverage the insights gained to fortify the security measures. This proactive approach equips organizations to better withstand genuine cyber threats, diminishing the likelihood of successful cyberattacks.
Graph-based Modelling
By modeling a comprehensive, multi-source cyber-security events dataset [2] into a heterogenous graph, as illustrated in Figure 5, we can reveal the intricate connections among all the network entities:
Detections and Investigations of Cybersecurity
In the event that a user's account is compromised and used for malicious activities, like brute force attacks, it often involves multiple failed logons on various devices as the attacker tries to gain unauthorized access. The detection of such patterns (Figure 6) is crucial for identifying potential security breaches.
Here's the UQL query to uncover users with more than 10 failed logon attempts to multiple computers within a specific 600-second timeframe:
n({@user} as u).e({@authFrom}).n({orientation == 'LogOn' && result == 'Fail' && time <=> [150000,150600]}).e({@authDst}).n({@computer} as c) as p
group by u
where count(p) > 10 && count(c) > 1
return u{*}
By executing this query, we identify potentially suspicious users who have exhibited such behavior (Figure 7). Further investigation can be conducted to understand the nature and extent of their login attempts. Intuitive visualization (Figure 8) can aid in assessing the scope and severity of the security incidents.
With the red team data revealing bad behaviors or attacks, post-event investigations are essential for uncovering the attack paths and extracting valuable insights. However, investigation is a challenging undertaking that requires a deep understanding of the attack vectors and the capability to trace back through the network, reconstructing the attack timeline. Next, we’ll see an illustrative example of a graph-based model depicting user authentication behavior [3] .
In the dataset, the logs of red team attacks are originally given as in Figure 9. We’ve chosen to focus the investigation on the malicious user U7394@DOM1.
First, filter the authentication logs to center on those having U7394@DOM1 as the source user. A total of 366 authentication logs is identified, spanning from time 1 to 767,320 (Figure 10).
领英推荐
Following the filtering of authentication logs, dependencies among logs are built based on the time to identify successive authentications. This enables the creation of a behavior graph for user U7394@DOM1, structured as follows:
In order to uncover what happened before a malicious event, we can extract paths from the behavior graph that lead to an attack-related authentication. We use the malicious authentication with ID "auth48" as example, and we limit the length of paths a maximum of 7:
n().re({_uuid <= 361})[:6].n({_id == "auth48"}) as p
return p{*}
Further analysis can then be conducted by professionals, utilizing the results (Figure 11) to pinpoint the attack paths. Among the various attack paths, one is highlighted in Figure 12. The sequence begins with the user successfully obtaining a Ticket Granting Service (TGS) from machine C743 through C3352. Notably, a cyclic pattern emerges, visible in the graph through bi-directional edges: LogOff from C2106, followed by LogOn, and subsequently another LogOff on the same machine. This repeated LogOff and LogOn pattern on the same machine over a brief period could be indicative of suspicious malicious activity. The user then proceeds to C1618 to acquire another TGS, which is followed by obtaining a Ticket Granting Ticket (TGT). Finally, the user leverages NTLM to carry out the attack on C492.
Scenario 3: Malware Detection in PE Files
Background
Malware (short for "malicious software") is crafted to harm digital systems or their users. Based on the behaviors, malware is classified as viruses, worms, trojans, bots, ransomware and so on. Cybercriminals use malware to launch attacks like accessing unauthorized systems, disrupting services, and stealing confidential information. In 2022 alone, a staggering 5.5 billion malware attacks were recorded worldwide (source: statista ).
The Portable Executable (PE) file stands out as one of the most popular vectors for malware. A PE file is used in Windows OS for the storage and execution of programs, dynamic-link libraries (DLLs), and others.
For a long time, the most significant lines against malware attacks are anti-malware software products. Traditionally, these products relied heavily on signature-based detections. However, attackers easily evade this approach using encryption and code obfuscation. Some anti-malware products have adapted to monitor the kernel behaviors of software. Nonetheless, this solution inherently entails high costs and scalability challenges.
In response to the ever-evolving landscape of malware attacks, some studies have emerged that leverage both file content (e.g., the extracted API calls) and file relationships (e.g., whether extracted APIs belong to the same DLL) to characterize and detect malware [4] .
Graph-based Modelling
A heterogeneous information network (HIN) can be constructed to model different types of entities and relationships (Figure 13). The HIN offers a structured approach for analyzing intricate relationships among diverse entities, aiming to make it more challenging for attackers to bypass.
There are 5 types of relationships between the 5 types of entities:
Furthermore, each node or edge schema is enriched with properties that provide diverse descriptions of the associated entity or relationship, adding greater semantic depth to the HIN. For instance, the file node has a label property with values "benign", "malware" or "unknown"; the archive node possesses a source property detailing its origin; the create edge includes a createdOn property to record the timestamp of when the file was created.
More Expressive Querying Techniques
Let’s determine whether the newly collected file, File-U, is malicious or not within the example graph (Figure 14).
To begin, we can compare files solely based on their extracted API calls. This can be accomplished by employing similarity graph algorithms like Jaccard Similarity . Jaccard similarity algorithm considers the immediate neighborhood set of nodes, and computes similarity scores of pairs of nodes by dividing the size of the union of two neighborhood sets by the size of their intersection.
By applying the algorithm to the subgraph consisting of file and API nodes (along with their interconnections, i.e., include edges), we identify File-B1 as the closest match, which is labeled as "benign" (Figure 15). This inference suggests that File-U is likely benign as well.
algo(similarity).params({
ids: 'File-U',
type: 'jaccard',
top_limit: 1
}).stream().node_filter({@file || @API}) as re
find().nodes({_uuid == re.node2}) as n
return table('File-U', n._id, n.label, re.similarity)
In fact, File-U is a malicious file associated with File-M1, which bears the label "malware". Although File-U and File-M1 have no common API calls, resulting a score of 0 in the similarity analysis, their API calls share inherent connections. Specifically, the API SetTimer called by File-M1 and the API SetDoubleClickTime called by File-U belong to the same DDL USER32.DLL; and these two APIs are called together by another "malware" file File-M2. To catch sly malware like File-U, it calls for complex queries.
Hence, we can establish that two files are relevant if they (1) have APIs belonging to the same DLL and, (2) these APIs are jointly called by any other malicious file. Within Ultipa Graph, this type of query can be easily expressed by the subgraph template (Figure 16).
By executing this subgraph template query in a Shortcut in Ultipa Manager (Figure 17). File-M1 and File-M2 are successfully identified!
The utilization of HINs and advanced querying techniques holds promise for improving malware detection, especially in the face of increasingly sophisticated threats, by considering not only individual file attributes but also their intricate relationships and behaviors within a broader network context.
Enhancing Malware Detection with Machine Learning
Machine Learning (ML) algorithms offer a powerful means to create automated malware detection systems capable of accurately distinguishing between benign and malicious file. These techniques also significantly boost detection speed and address the evolving threat landscape.
Data representation is a pivotal step involving the transformation of data into a machine-readable format suitable for ML models. The graph embedding algorithms are widely used for this purpose, and many innovative algorithms tailored for HIN have been developed. These graph embedding algorithms convert the entities (PE files) within a graph into lower-dimensional vectors.
The relatedness among PE files can be characterized through a series of such subgraph templates, each referred to as a meta-graph. Figure 18 shows an abstract workflow for embedding all file nodes within the HIN. Unlike conventional algorithms like Node2Vec and Struc2Vec , which performs better in homogeneous graphs, this approach targets heterogeneous graphs. Here, the training data (corpus) is generated through random walks guided by various meta-graphs. Subsequently, the Skip-gram model produces node embeddings that capture distinct structural features. Given that different meta-graphs offer diverse perspectives on the relatedness among PE files, a multi-view fusion approach is employed to combine node embeddings derived from various meta-graph schemes. This strategy effectively explores the complementary nature of these perspectives.
After converting the nodes into vector embeddings, a diverse set of ML models can be employed to enhance malware detection capabilities. One commonly used model is K-Means Clustering . It enables the grouping of similar PE files based on the features extracted from their embeddings, thereby assisting in the identification of potentially malicious file clusters (Figure 19). This fusion of embeddings and ML models constitutes a potent strategy for fortifying cybersecurity defenses and proactively mitigating the ever-evolving threats posed by malware.
Ultipa for Cybersecurity
The rapid advancement of technology has ushered in both convenience and peril, placing businesses in an era that demands a stronger shield against cybercrimes than ever before. To navigate this challenging landscape, it has become imperative for organizations to fortify their defenses against these threats. Leveraging graph databases like Ultipa can be the key to establishing an exceptional data infrastructure that not only safeguards your network, information, and financial assets but also empowers you with the insights needed to proactively thwart cyberattacks and ensure the resilience of your digital ecosystem.
References
[1] Unified Host and Network Data Set , M. Turcotte, A. Kent, C. Hash, in Data Science for Cyber-Security (2018)
[2] Comprehensive, Multi-Source Cybersecurity Events , A. D. Kent, Los Alamos National Laboratory (2015)
[3] Graph-based malicious login events investigation , F. Amrouche, S. Lagraa, G. Kaiafas, R. State (2019)
[4] Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System , Y. Fan, S. Hou, Y. Zhang, Y. Ye, M. Abdulhayoglu (2018)
Written by: Pearl Cao
Download a PDF version of this article at https://www.ultipa.com/solutions/cybersecurity