Graphing the Future of Cybersecurity

Graphing the Future of Cybersecurity

Guarding Against the Surge

Cybercrime stands out as one of the most pressing challenges confronting enterprises, ranking among the fastest-growing crimes globally. An array of compelling facts and statistics underscores the urgency for organizational leaders to adopt a proactive approach to cybersecurity.

The cyber network encompasses diverse infrastructures, upper-level applications, end-point clients and users, forming a complex, interconnected environment. In the arms race between attackers and defenders, adversaries often bypass existing detection mechanisms by discovering new attack vectors, while defenders are tired of addressing ceaseless vulnerabilities. Therefore, practitioners have to reconsider traditional approaches and focus on devising more resilient strategies for countering both known and unforeseen threats.

If it were measured as a country, cybercrime would be the world’s third-largest economy after the U.S. and China. (Cybersecurity Ventures, 2022)

Many advancements in the realm of cybersecurity have gravitated towards graph technology, drawn by its innate ability to preserve relationships and deliver exceptional performance in terms of both efficiency and scalability. Although achieving 100% security remains challenging, the incorporation of flexible and powerful graph queries, algorithms, and eXplainable AI techniques has markedly bolstered the cybersecurity posture of organizations.

Addressing the Pain Points

Security data, such as logs and records, often arrives in both sizable and unstructured formats. Handling such extensive unstructured data with tabular-oriented tools yields suboptimal results. While it's possible, it comes at cost: integrating new data sources becomes challenging, data structuring and querying grow complex, and query performance suffers.

In a word, the primary hurdle in cybersecurity analysis isn't a dearth of available data but rather the consolidation of diverse information from myriad sources into a cohesive model. This approach enriches the comprehension of the cybersecurity landscape and supports real-time decision-making. Traditional relational databases prove inadequate for this task.

Graph Technology for Cybersecurity

Graph databases like Ultipa streamline the storage, querying, and analysis of unstructured data, even as data volumes quickly expand. In the context of cybersecurity, nodes represent entities like system objects, products, vulnerabilities, devices and users, while edges signify the relationships, interactions or the flows connecting these entities.

We will delve into several concrete scenarios to illustrate the transformative power of graph technology in bolstering cybersecurity measures.


Scenario 1: Cyber-attack Identification on Network Flow

Background

From a security perspective, the vast majority of devices within organizations are connected and communicate over a network. By monitoring network traffic, we can extract network flow records, typically collected by devices like routers. These records capture sequences of packets exchanged between internet hosts within specific time intervals.

In recent times, security analysts have recognized the significant potential of network flow data as a robust foundation for security frameworks. It empowers the detection of suspicious or malicious activities that could indicate a cyber-attack.

Let's use bi-directional network flows as an example. In a bi-directional flow, data packets are simultaneously exchanged between two devices — from the source device to the destination device and vice versa. Such bi-directional flows are common in various network communication settings like web browsing, email exchange, and real-time chat applications. For instance, when you request a webpage from a web server, the server responds by sending data back to your device, establishing a bi-directional flow.

Graph-based Modelling

Leveraging graph-based models for network flow data [1] offers a holistic view of the network (Figure 2). To begin, we establish each bi-directional network flow record as this: a flow edge connects the source and destination device nodes, with the direction from the source to the destination (Figure 1).?

Some properties are associated with each node and edge:

  • device node: To simplify, we only consider the ID of the device. ?
  • flow edge: The start time of the event in epoch time format; the duration of the event in seconds; the protocol number; the port used by the source and destination devices; the number of packets and the number of bytes the two devices sent during the event.

Figure 1. The path representing a bi-directional network flow
Figure 2. 3D comprehensive view of the network flows in Ultipa Manager

Flow-based Intrusion Detections

Flow-based techniques offer several advantages and are successful at detecting various malicious network behaviors. When communication takes place with amounts of packets and bytes moved, this information is sufficient for detecting many attacks, especially when backed by a graph database.

For example, a DDoS (Distributed Denial of Service) attack is a malicious attempt to disrupt the functioning of a targeted service by overwhelming it with a flood of internet traffic, making it unable to respond to legitimate requests. Thus, when a device receives an unusually high volume of packets or bytes within a specified short time period, it can be an indicator of potential DDoS attacks.

Setting up a DDoS alarm can be achieved with ease through the Degree Centrality algorithm. Below is the UQL to compute the in-degree of nodes weighted by the edge property srcPackets within a specific timeframe. It identifies devices that have received a sum of packets surpassing the predefined threshold of 10,000 during this period (Figure 3).

algo(degree).params({
    direction: 'in',
    edge_schema_property: 'srcPackets'
}).edge_filter({time <=> [5378315,5378530]}).stream() as re
where re.degree > 10000
find().nodes({_uuid == re._uuid}) as dev
return table(dev._id, re.degree)        
Figure 3. Detect potential DDoS-attack targets in Ultipa Manager

Port scanning is a network reconnaissance technique used by attackers to discover open ports and services on a target system. It’s an essential step in the early stages of many cyberattacks, as it helps attackers identify potential vulnerabilities. Port scanning activities can be detected by monitoring network traffic for suspicious patterns, such as a high rate of connection attempts to different ports of a device from a single source.

With UQL, port scanning detection can be accomplished through a very basic pathfinding template written as n().re().n(), where n() stands for nodes and re() signifies outgoing edges. We set a filter for the edges specifying a time range. After grouping the query results by the source and destination devices, we count the distinct dstPort values and return results where the number of distinct ports exceeds 100 (Figure 4). This approach allows for the identification of potentially malicious port scanning behavior in the network traffic data.

n(as src).re({time <=> [5380630,5380690]} as com).n(as dst)
group by src, dst
with count(distinct(com.dstPort)) as portNo
where portNo > 100
return table(src._id, dst._id, portNo)        
Figure 4. Detect potential port scanning attacks in Ultipa Manager

Scenario 2: User Abnormal Behavior Detection

Background

Abnormal user or account behaviors encompass a range of activities such as excessive login failures, anomalous access times, privilege escalation, abnormal data access and transfers. These deviations can serve as indicators of potential network attacks or unauthorized access.

Many organizations employ Red Teams to pinpoint vulnerabilities and scrutinize their security posture. Red teams utilize diverse tactics, techniques, and procedures (TTPs) to simulate real-world cyber threats. Their intent is not to cause harm but to furnish actionable feedback for the enhancement of detecting, countering, and mitigating security threats. Following a red team exercise, the organization's security team can leverage the insights gained to fortify the security measures. This proactive approach equips organizations to better withstand genuine cyber threats, diminishing the likelihood of successful cyberattacks.

Graph-based Modelling

By modeling a comprehensive, multi-source cyber-security events dataset [2] into a heterogenous graph, as illustrated in Figure 5, we can reveal the intricate connections among all the network entities:

  • auth: Authentication events, each has information of time, source user, destination user, source computer, destination computer, authentication type, logon type, authentication orientation and ?success/failure.
  • process: Process start and stop events, each has information of time, user, computer, process name and start/end.
  • flow: Network flow events, each has information of time, duration, source computer, source port, destination computer, destination port, protocol, packet count and byte count.
  • dns: Domain Name Service (DNS) lookup events, each has information of time, source computer and computer resolved.
  • redTeam: Specific events taken from the authentication data that present known red team compromise events, each has information of time, user, source computer and destination computer.

Figure 5. Schema overview of diverse network players and events, and interactions

Detections and Investigations of Cybersecurity

In the event that a user's account is compromised and used for malicious activities, like brute force attacks, it often involves multiple failed logons on various devices as the attacker tries to gain unauthorized access. The detection of such patterns (Figure 6) is crucial for identifying potential security breaches.

Figure 6. Pattern of failed logons to multiple devices from the same user

Here's the UQL query to uncover users with more than 10 failed logon attempts to multiple computers within a specific 600-second timeframe:

n({@user} as u).e({@authFrom}).n({orientation == 'LogOn' && result == 'Fail' && time <=> [150000,150600]}).e({@authDst}).n({@computer} as c) as p
group by u
where count(p) > 10 && count(c) > 1
return u{*}        
Figure 7. Detect suspicious users in Ultipa Manager

By executing this query, we identify potentially suspicious users who have exhibited such behavior (Figure 7). Further investigation can be conducted to understand the nature and extent of their login attempts. Intuitive visualization (Figure 8) can aid in assessing the scope and severity of the security incidents.

Figure 8. Visualize the failed logons of suspicious users in Ultipa Manager

With the red team data revealing bad behaviors or attacks, post-event investigations are essential for uncovering the attack paths and extracting valuable insights. However, investigation is a challenging undertaking that requires a deep understanding of the attack vectors and the capability to trace back through the network, reconstructing the attack timeline. Next, we’ll see an illustrative example of a graph-based model depicting user authentication behavior [3] .

In the dataset, the logs of red team attacks are originally given as in Figure 9. We’ve chosen to focus the investigation on the malicious user U7394@DOM1.

Figure 9. Red team attacks performed by U7394@DOM1

First, filter the authentication logs to center on those having U7394@DOM1 as the source user. A total of 366 authentication logs is identified, spanning from time 1 to 767,320 (Figure 10).

Figure 10. Authentication logs with the source user

Following the filtering of authentication logs, dependencies among logs are built based on the time to identify successive authentications. This enables the creation of a behavior graph for user U7394@DOM1, structured as follows:

  • Authentication logs are represented as nodes, each with properties dstUser, srcComp, dstComp, authType, logonType, orientation and result. The 366 authentication logs are condensed into 49 distinct nodes within the graph.
  • Each directed edge (u,v) indicates that the authentication v occurs after the authentication u.

In order to uncover what happened before a malicious event, we can extract paths from the behavior graph that lead to an attack-related authentication. We use the malicious authentication with ID "auth48" as example, and we limit the length of paths a maximum of 7:

n().re({_uuid <= 361})[:6].n({_id == "auth48"}) as p 
return p{*}        
Figure 11. Query of paths leading to the malicious event "auth48" (in red)

Further analysis can then be conducted by professionals, utilizing the results (Figure 11) to pinpoint the attack paths. Among the various attack paths, one is highlighted in Figure 12. The sequence begins with the user successfully obtaining a Ticket Granting Service (TGS) from machine C743 through C3352. Notably, a cyclic pattern emerges, visible in the graph through bi-directional edges: LogOff from C2106, followed by LogOn, and subsequently another LogOff on the same machine. This repeated LogOff and LogOn pattern on the same machine over a brief period could be indicative of suspicious malicious activity. The user then proceeds to C1618 to acquire another TGS, which is followed by obtaining a Ticket Granting Ticket (TGT). Finally, the user leverages NTLM to carry out the attack on C492.

Figure 12. An attack path and the details of the authentications involved

Scenario 3: Malware Detection in PE Files

Background

Malware (short for "malicious software") is crafted to harm digital systems or their users. Based on the behaviors, malware is classified as viruses, worms, trojans, bots, ransomware and so on. Cybercriminals use malware to launch attacks like accessing unauthorized systems, disrupting services, and stealing confidential information. In 2022 alone, a staggering 5.5 billion malware attacks were recorded worldwide (source: statista ).

The Portable Executable (PE) file stands out as one of the most popular vectors for malware. A PE file is used in Windows OS for the storage and execution of programs, dynamic-link libraries (DLLs), and others.

For a long time, the most significant lines against malware attacks are anti-malware software products. Traditionally, these products relied heavily on signature-based detections. However, attackers easily evade this approach using encryption and code obfuscation. Some anti-malware products have adapted to monitor the kernel behaviors of software. Nonetheless, this solution inherently entails high costs and scalability challenges.

In response to the ever-evolving landscape of malware attacks, some studies have emerged that leverage both file content (e.g., the extracted API calls) and file relationships (e.g., whether extracted APIs belong to the same DLL) to characterize and detect malware [4] .

Graph-based Modelling

A heterogeneous information network (HIN) can be constructed to model different types of entities and relationships (Figure 13). The HIN offers a structured approach for analyzing intricate relationships among diverse entities, aiming to make it more challenging for attackers to bypass.

Figure 13. Schema overview of the constructed HIN

There are 5 types of relationships between the 5 types of entities:

  • file-[exist]-machine: A PE file exists on a machine.
  • file-[replace]-archive: A PE file and the archive it replaces.
  • file-[create]-file: A PE file can create another PE file during execution or installation.
  • file-[include]-API: A PE file and its extracted API calls which reflect its behaviors.
  • API-[belongTo]-DLL: An API belongs to a DLL (Dynamic-Link Library).

Furthermore, each node or edge schema is enriched with properties that provide diverse descriptions of the associated entity or relationship, adding greater semantic depth to the HIN. For instance, the file node has a label property with values "benign", "malware" or "unknown"; the archive node possesses a source property detailing its origin; the create edge includes a createdOn property to record the timestamp of when the file was created.

More Expressive Querying Techniques

Let’s determine whether the newly collected file, File-U, is malicious or not within the example graph (Figure 14).

Figure 14. The example PE malware detection graph

To begin, we can compare files solely based on their extracted API calls. This can be accomplished by employing similarity graph algorithms like Jaccard Similarity . Jaccard similarity algorithm considers the immediate neighborhood set of nodes, and computes similarity scores of pairs of nodes by dividing the size of the union of two neighborhood sets by the size of their intersection.

By applying the algorithm to the subgraph consisting of file and API nodes (along with their interconnections, i.e., include edges), we identify File-B1 as the closest match, which is labeled as "benign" (Figure 15). This inference suggests that File-U is likely benign as well.

algo(similarity).params({
    ids: 'File-U',
    type: 'jaccard',
    top_limit: 1
}).stream().node_filter({@file || @API}) as re
find().nodes({_uuid == re.node2}) as n
return table('File-U', n._id, n.label, re.similarity)        
Figure 15. Run the Jaccard Similarity algorithm in Ultipa Manager

In fact, File-U is a malicious file associated with File-M1, which bears the label "malware". Although File-U and File-M1 have no common API calls, resulting a score of 0 in the similarity analysis, their API calls share inherent connections. Specifically, the API SetTimer called by File-M1 and the API SetDoubleClickTime called by File-U belong to the same DDL USER32.DLL; and these two APIs are called together by another "malware" file File-M2. To catch sly malware like File-U, it calls for complex queries.

Hence, we can establish that two files are relevant if they (1) have APIs belonging to the same DLL and, (2) these APIs are jointly called by any other malicious file. Within Ultipa Graph, this type of query can be easily expressed by the subgraph template (Figure 16).

Figure 16. The subgraph template for identifying files relevant to File-U

By executing this subgraph template query in a Shortcut in Ultipa Manager (Figure 17). File-M1 and File-M2 are successfully identified!

Figure 17. Run subgraph template query shortcut in Ultipa Manager

The utilization of HINs and advanced querying techniques holds promise for improving malware detection, especially in the face of increasingly sophisticated threats, by considering not only individual file attributes but also their intricate relationships and behaviors within a broader network context.

Enhancing Malware Detection with Machine Learning

Machine Learning (ML) algorithms offer a powerful means to create automated malware detection systems capable of accurately distinguishing between benign and malicious file. These techniques also significantly boost detection speed and address the evolving threat landscape.

Data representation is a pivotal step involving the transformation of data into a machine-readable format suitable for ML models. The graph embedding algorithms are widely used for this purpose, and many innovative algorithms tailored for HIN have been developed. These graph embedding algorithms convert the entities (PE files) within a graph into lower-dimensional vectors.

Figure 18. Data Representation Learning for HIN

The relatedness among PE files can be characterized through a series of such subgraph templates, each referred to as a meta-graph. Figure 18 shows an abstract workflow for embedding all file nodes within the HIN. Unlike conventional algorithms like Node2Vec and Struc2Vec , which performs better in homogeneous graphs, this approach targets heterogeneous graphs. Here, the training data (corpus) is generated through random walks guided by various meta-graphs. Subsequently, the Skip-gram model produces node embeddings that capture distinct structural features. Given that different meta-graphs offer diverse perspectives on the relatedness among PE files, a multi-view fusion approach is employed to combine node embeddings derived from various meta-graph schemes. This strategy effectively explores the complementary nature of these perspectives.

Figure 19. An illustration of K-Means Clustering for 3D Embeddings

After converting the nodes into vector embeddings, a diverse set of ML models can be employed to enhance malware detection capabilities. One commonly used model is K-Means Clustering . It enables the grouping of similar PE files based on the features extracted from their embeddings, thereby assisting in the identification of potentially malicious file clusters (Figure 19). This fusion of embeddings and ML models constitutes a potent strategy for fortifying cybersecurity defenses and proactively mitigating the ever-evolving threats posed by malware.


Ultipa for Cybersecurity

The rapid advancement of technology has ushered in both convenience and peril, placing businesses in an era that demands a stronger shield against cybercrimes than ever before. To navigate this challenging landscape, it has become imperative for organizations to fortify their defenses against these threats. Leveraging graph databases like Ultipa can be the key to establishing an exceptional data infrastructure that not only safeguards your network, information, and financial assets but also empowers you with the insights needed to proactively thwart cyberattacks and ensure the resilience of your digital ecosystem.


References

[1] Unified Host and Network Data Set , M. Turcotte, A. Kent, C. Hash, in Data Science for Cyber-Security (2018)

[2] Comprehensive, Multi-Source Cybersecurity Events , A. D. Kent, Los Alamos National Laboratory (2015)

[3] Graph-based malicious login events investigation , F. Amrouche, S. Lagraa, G. Kaiafas, R. State (2019)

[4] Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System , Y. Fan, S. Hou, Y. Zhang, Y. Ye, M. Abdulhayoglu (2018)


Written by: Pearl Cao

Download a PDF version of this article at https://www.ultipa.com/solutions/cybersecurity

要查看或添加评论,请登录

Ultipa的更多文章

社区洞察

其他会员也浏览了