Security Data Lakes: A New Tool for Threat Hunting, Detection & Response, and GenAI-Powered Analysis

Security Data Lakes: A New Tool for Threat Hunting, Detection & Response, and GenAI-Powered Analysis

This article explores the key benefits of security data lakes, including advanced use cases for threat hunting, streamlined detection and response workflows, and their role in GenAI-powered analysis. It also addresses the challenges of managing such large-scale data environments and offers solutions for optimizing performance.

What is a Security Data Lake?

A security data lake is a centralized repository of security data where you can store both structured and unstructured data at large scale. It enables you to store data in its raw form without needing to pre-structure it.

A security data lake typically uses an object storage cloud service to store the data, such as?AWS S3?or?Azure Blob Storage. It can scale indefinitely and is designed to be cost-effective.

The total size of the data in a security data lake is measured in terabytes or petabytes, which means that significantly more data is available for analysis compared to traditional security data analysis tools.

Thanks to the sheer scale of the data that can be stored and accessed in data lakes, they unlock dramatically new use cases for threat hunting, detection and response, and even now GenAI.

3 New Use Cases that are Unlocked by Security Data Lakes


1: Threat Hunting

Security data lakes allow analysts to search far more historical data than before to find indicators of compromise, like malicious IP addresses, domains, and ransomware, making it finally possible to do threat hunting over large time ranges.

Here’s a scenario showing how a security data lake can help with threat hunting:

  • Let’s say that new research has just been released about an attack campaign that has been active for the past nine months and was only recently detected by the security community.
  • The research includes indicators of compromise, such as domains, IP addresses, and file hashes associated with the attack.
  • Your team needs to investigate the incident by answering the following questions:Are any of these indicators of compromise present in our historical logs going back nine months?If so, was our system successfully breached?Is there an advanced persistent threat still lurking?
  • Before the availability of data lake query tools, querying historical data going back nine months was a significant challenge. You would need to rehydrate logs from cold storage back to your original SIEM or log search tool, which could take days and be very expensive.
  • With your security data lake, you have 18 months of data readily accessible for analysis. In data lake querying tools, like?Amazon Athena?or?Azure Data Lake Storage, you can search for these indicators of compromise over a large time range, such as nine months.
  • Let’s say your team finds these indicators of compromise in logs from 6 months ago, which is a time range well outside of what your traditional SIEM can see.
  • With some more searching in the data 6 months ago, you discover an advanced persistent threat using a DNS for command-and-control. This gives you the information you need to shut down the threat.

Since security data lakes can store far more data than traditional security log tools, it’s now feasible to keep your historical logs and search them, allowing you to find threats in the past much more easily.


2: Detection & Response

In detection and response, it is useful to correlate activity across a wide range of data sources to trace the steps of an incident and learn how to respond.

Since security data lakes are designed to handle both structured and unstructured data of various formats, they allow you to store a wide variety of data sources in one central location and correlate across them.

Here’s a scenario showing how a security data lake can help with detection and response:

  • Let’s say that you receive an alert indicating that an employee at your company clicked on a phishing link in an email.
  • Your team needs to investigate the incident by answering the following questions:What other sites did the user visit after clicking the link? This helps identify other potentially malicious domains.Did any malware get installed on the user’s laptop afterward?What internal systems did the user log in to afterward? Is there any evidence of compromise?
  • Thankfully, your data lake contains a wide variety of data that can help with this investigation, including DNS logs, endpoint logs, and identity provider logs.
  • First, you query your DNS logs in the data lake to see what other domains the user visited shortly after clicking the phishing link.
  • Second, you query your endpoint logs to determine what activity occurred on the user’s laptop shortly after they clicked on the phishing link. You find a process that appears to have connected to a suspicious domain that downloaded and executed a script, which could be malware or a command-and-control process.
  • Third, you query your Identity Provider logs to check for login events from this user into other internal systems to determine if any internal systems were compromised.
  • You respond to the issue by disabling the user’s access to various systems, blocking the phishing and malware domains, adding a detection rule to look for that script command in endpoint logs, and start checking the internal systems that the user interacted with.

Since the data lake serves as a centralized store for your security data, you can run all of these queries across many data sources in one place, tracing through all of the steps of an incident without needing to jump between many different tools.


3: GenAI-Powered Analysis

Security data lakes can be extremely helpful for GenAI-powered analysis, like giving your team the ability to ask security questions in plain English (eg. “Are we adequately monitoring and mitigating potential insider threats?”) and getting powerful insights as a result.

There are three main approaches that teams take when they embark on GenAI projects.

  • Retrieval-Augmented Generation (RAG)
  • Fine-tuning an existing foundational model
  • Training a foundational model from scratch

It’s almost always recommended to use RAG or fine-tuning on top of an existing foundational model rather than training your own model from scratch.

In this article, we’ll focus on an example that uses RAG.


RAG Example System

Retrieval-Augmented Generation (RAG)?involves retrieving data and incorporating it into your prompt to enhance the response generated by a large language model (LLM). This approach brings relevant data into the context window of the model to improve the quality of its answers.

Let’s say you want to build an internal tool at your company that allows you to ask security questions in plain English.

Here is how the internal tool could work:

  • Let’s say you choose?Claude 3.5 Sonnet via Amazon Bedrock?to be your LLM. This model is reasonably capable of writing code and queries, so it’s a good choice to use to interact with your data lake.
  • Before sending the question to the LLM, your internal tool retrieves all of the table schema definitions in your data lake catalogs. It uses this information to augment the prompt and generate an improved response.
  • Given the augmented prompt, the LLM responds by providing a few suggested SQL queries you can run against your data lake.


Here is an example of what the prompt could look like:

You are a senior security analyst specializing in insider threat detection and security data analysis.

Using the provided data schemas, generate specific SQL queries to answer the security-related question.

Please structure your response as follows:
1. Key risk areas to analyze based on available data
2. For each risk area:
   - SQL query to extract relevant data
   - Description of what the query reveals
   - Specific indicators or thresholds that may suggest insider threats
   - Any data gaps or additional logging recommendations

Consider the following data points in your analysis:
- Unusual access patterns or privilege escalations
- Off-hours activity
- Access from unexpected locations/IPs
- Failed authentication attempts
- Bulk data operations
- Account/role modifications
- Actions across multiple accounts/services

Additional context (adjust as needed):
- Business hours: 9am-5pm EST
- Office locations: [specify]
- High-risk systems/data: [specify]

Data Sources:
<DataLakeTables>
[your existing schema definitions]
</DataLakeTables>

<Question>
[your question]
</Question>        

I’ll analyze key insider threat risk areas using the available CloudTrail and Okta logs.

1. Privileged Access Misuse

WITH privileged_actions AS (
  SELECT
    userIdentity.userName,
    sourceIpAddress,
    eventName,
    eventTime,
    requestParameters
  FROM cloudtrail_logs
  WHERE eventName LIKE 'Create%'
    OR eventName LIKE 'Delete%'
    OR eventName LIKE 'Update%'
    OR eventName LIKE 'Put%'
  AND DATE(eventTime) >= CURRENT_DATE - 30
)
SELECT
  userName,
  COUNT(*) as action_count,
  COUNT(DISTINCT sourceIpAddress) as unique_ips
FROM privileged_actions
GROUP BY userName
HAVING action_count > 100
ORDER BY action_count DESC;        

Indicators:

– High volume of privileged actions (>100/month)

– Actions from multiple IPs

– Sudden increase in privileged operations

2. Off-Hours Activity

SELECT
  c.userIdentity.userName,
  c.eventName,
  c.sourceIpAddress,
  c.eventTime,
  CASE
    WHEN HOUR(c.eventTime) < 9 OR HOUR(c.eventTime) > 17 THEN 'Outside'
    ELSE 'Inside'
  END as business_hours
FROM cloudtrail_logs c
WHERE DATE(eventTime) >= CURRENT_DATE - 7
  AND business_hours = 'Outside'
  AND eventName NOT LIKE '%Read%'
  AND eventName NOT LIKE '%List%'
  AND eventName NOT LIKE '%Describe%'
GROUP BY 1,2,3,4,5
HAVING COUNT(*) > 10;        

Indicators:

– 5+ failures within 24 hours

– Attempts from unusual locations

– Pattern of failures followed by success

Recommendations:

1. Add data correlation between Okta and CloudTrail events

2. Implement session duration logging

3. Track data exfiltration events (S3 downloads, database exports)

4. Log privilege escalation attempts

5. Monitor service account usage

?

Additional metrics to consider:

– Resource access patterns

– Volume of data accessed

– Geographic access patterns?

– Account sharing indicators

– Password reset frequencies

There is a good chance that you can run the queries that the LLM suggested verbatim, but you may need to tweak them a little depending on your use cases, data volume, etc.?

Also, you can imagine various ways to improve the system, like building additional prompt templates that can be filled with query results to go into more detailed investigations.

As this example illustrates, using RAG together with a security data lake can be highly useful for GenAI-powered analysis, providing a rich data source that can make an LLM much more capable of providing insights.

?

Challenges your team may face running a security data lake

Data lakes are powerful, but they come with various challenges due to their sheer scale and the variety of data they contain.

?

How do we route data into the data lake?

Many vendors offer features to automatically export data to cloud storage, such as AWS S3. Examples include Cloudflare DNS and HTTP logs, Crowdstrike Falcon Data Replicator for XDR logs, AWS cloud audit logs, WAF logs, and VPC flow logs. It is likely that this export feature will become increasingly common.

Log pipeline tools like Cribl, Fluentd, and Logstash can help fetch data from different sources and route it into cloud storage.?

At Scanner, we have a guide in our documentation about how to load a few dozen of the most common security log sources into a data lake.

The trend seems to be that eventually almost all security data sources will integrate directly into cloud storage like AWS S3 or Azure Blob Storage, making it easy to feed them directly into security data lakes.

?

How do we handle the wide variety of data formats?

In a data lake, data exists in a variety of formats, both structured and unstructured. This diversity can make querying difficult.

Some tools provide mechanisms for transforming common data sources into a standard schema, such as the Open Cybersecurity Schema Framework (OCSF). Amazon Security Lake is one such tool.

Tools like Presto or Amazon Athena can scan files in their raw format, but they require a defined schema, which can be challenging when dealing with diverse data formats.

A new category of data lake tools has emerged that can analyze both structured and unstructured data without requiring users to define schemas beforehand. The team I work with at Scanner.dev are building tools here, making it as easy as possible to run full-text search and detections on any data format.

?

Our data set is very large now, how do we speed up our queries?

Handling hundreds of terabytes or even petabytes of data can be slow unless you take steps to optimize or use specialized tools. To optimize SQL queries on tabular data, consider converting data to Parquet format. For optimizing full-text search on unstructured data, creating a search index can be beneficial.

Choosing the right tool for analyzing different kinds of data is crucial:

  • For structured, tabular data, with formats like CSV and Parquet (eg. network flow logs, DNS logs), use a structured SQL-based tool, like Presto, Amazon Athena, or Azure Data Lake Storage.
  • For unstructured, complex data, with formats like deeply-nested, complex JSON (eg. web application firewall logs, cloud audit logs, and identity logs), there is a new category of tools for unstructured, full-text search, like Scanner.


The Road Ahead

As the security landscape evolves, security data lakes will remain essential for staying ahead of threats, responding effectively to detection alerts, and leveraging AI advancements to strengthen security.

I’m convinced that eventually almost all data relevant to security will make its way to data lakes in cloud storage. The tools here will continue to improve, and it will become easier and easier to analyze the wide diversity of data stored there.

Thanks to their ability to store and analyze massive data sets with a wide variety of formats, security data lakes can provide immense value for threat hunting, detection and response, and even now GenAI.

Learn More at Scanner.dev


要查看或添加评论,请登录

Scanner.dev的更多文章

社区洞察

其他会员也浏览了