Curse of the security data lake monster

Curse of the security data lake monster

In my time I have seen security data lake projects turn into "a few good ideas and a smoking crater" more than once. There are several pitfalls that we can learn from, and a great success story we don't often celebrate as a security data-lake use case.

There comes a time in a security team's maturity where they end up deciding that security has a data problem and the team needs to become more data driven (like the rest of the org), therefore a security data-lake is the answer (to a problem that's not fully fleshed out).

Yet... in many security organizations there is a bona fide success story that we do not call a security data lake, but for all intends and purposes is one. I am talking about a SIEM. Despite there being many, many examples of failed roll outs of a SIEM there are more successes than failures and for the most part it's an established part of information security.

What are the major pitfalls of security data lakes?

1) The wrong organisational structure with the wrong data store/platform solving the wrong problem.

In general the failures of many data lakes are an organisational structure and a process problem. There is a great article by Zhamak Dehghani on the phenomenon (see link). Key to this failure is having a centralised team servicing multiple stakeholders with a centralised data lake (often nicknamed a data swamp).

Equally some teams might go into using a data lake platform thinking it has the same capabilities as other data stores the are used to (like a relational database e.g. MySQL) and discover it's tradeoffs too late in the process.

This is especially true if you require a near real-time answer to a (large) query and have not factored you might want to perform the data processing on the data lake (daily/hourly/other depending on the data refresh) and serve that data in a "lake shore-mart' database to meet your SLAs for customers/use cases. It's the classic "when you are a hammer every problem is a nail" mistake.

2) Engineers wanting to play with new toys and not a clue.

Another pitfall is a senior engineer discovering a shiny new capability (e.g. Graph databases, data streaming, Apache BEAM), and building a project/initiative around a piece of technology that has not gone through technical due diligence or enough training for the engineers prior to starting.

No alt text provided for this image
Seriously... Say "Graph the planet" one more time without knowing what betweenness centrality means... I dare you!

I remember passing by an engineers desk, asking them if they needed help to get a deliverable on time, and the answer was "no, just need the time to figure it out" with an O'reilly book open on the desk. Cut to a few months later, the project complete and the solution is not fit for purpose (returning results 24 hours out of our SLA). When another engineer took over with experience in the technology, they had to rewrite the whole thing and recommended long term to scrap the approach altogether as the tech was not fit for purpose to begin with.

3) Too much work and resources for little to no reward

A security data lake can be an endless time & money sink if you are not careful. Be mindful of the work needed for data ingestion, data processing, data quality, governance, serving the data, and maintaining the data platform. Without the support/structure, the right use case, the data already in a good state, and the right platform - there is a high risk this will hurt.

4) A single point of failure many security teams

Let's say you have delivered your fist set of use cases, and now the security data lake is entrenched into the processes of several security teams. What happens when a team wishes to move on to a new security solution and needs the data ingestion/processing/serving to change in order to factor in this new solution? The answer is that the data engineering team becomes a bottleneck for security initiatives, and the more entrenched the bigger the problem.

5) The data & data model is just not there - Use cases for immature processes

I cannot overstate this pitfall. One of the first things Google teaches in it's "How Google Does Machine Learning" course, is that you need to ensure the maturity of the process you are trying to provide a big data solution for.

Is the data available in a usable and accessible form?

To solve the data problem do you have the right data model to link different datasets for your use case?

Imagine you want to map a set of security findings to the team responsible for resolving the issue. My favourite example of this is: A bug bounty finding that reported a vulnerability found in a specific parameter in a specific web page. To solve this mapping you need to have the ability to map (somewhere) the FQDN/URL/API/Page to the relevant team, and the data from the bug bounty program needs to be consumable in a usable fashion you are able to do that mapping. Good Luck with that.

What made the SIEM a security data lake success story?

It is not all doom and gloom though. There are success stories like the SIEM and it's important to understand what made them work, and what can we do to limit the likelihood of failure.

1) Achievable & impactful use case.

Use Case: Collect the following monitoring logs from our systems and if the following events are triggered (i.e. indicators of compromise) raise an alert for the security team to triage.

Simple.achievable.impactful cases.

Of course there is more than that in the example above (tune the rules to avoid false positives, alerting fatigue, false negatives etc..), that said it has found its place in the security industry for good reason.

2) Support and maintenance

While not insignificant, the effort to maintain such a platform is worth the risk/reward. Teams have enough support, documentation, training and a community that they are able to gain value from these solutions.

Not to mention a small cottage industry of professional services to support and make the most of that capability.

3) The data & data models are in a usable state

As mentioned above, I can't begin to tell you how crucial it is for a use case that the data exists and that it exists in a usable state (a little effort here and there notwithstanding). This is a key success for the SIEM, that for the most part the data is there to make the use case achievable.

One often missed success factor for a use case is a good data model. For example the Common Information Model in Splunk, which reduces some of the design activities which are over looked.

4) Extensibility

Another key success is that you can build on top of what you have rather than having to rebuilt existing capabilities. Multiple use cases on the existing data sets, and the community built additional capabilities (SOAR, UEBA) to further improve the use cases.

5) The right data-store for the right job

Last but not least the current successful SIEMs are built on top of data stores which meet the user requirements. I remember over a decade ago that was not the case. There were SIEM products, while good on paper, that would hit technical limits based on the amount of data and data processing requirements thrown at it.

What can we learn from this to avoid common pitfalls

1) Make sure you have a data product mindset going in

Have a focus on the value you are trying to get out of your data initiatives and start with simple & valuable use cases. Take a product approach (a continuous stream of work focused on the intended value) opposed to a project approach (a scope and time bound effort focused on delivering work- not value).

2) The right team for the job

Have data engineering team members available and committed data security products your team wishes to have. Do not go into initiatives without the right technical due diligence and training ahead of time.

Ensure your internal users (e.g. other security teams) have been training on using the analytics tools enough to be able to use the capabilities and build on top of them. An example of that is for them to write their own queries, know how to discover what data is available and where it makes sense to ingest their own data.

3) Leverage common tooling, platforms and self service model

If the company has a data team that has a reference architecture for platforms and support functions, it is strongly recommended that you leverage those as you would be able to reach out to those teams for help if you are leveraging a comment set of tools and design paradigms

Ensure you have the right self service model for the data platform so that your security teams are not bottle necked on 1-2 engineering resources.

Make the data discoverable in your company's data catalogue so that other internal teams can take advantage of it (with the right governance). Example Internal Audit, Compliance, Data Privacy Office and others.

4) Use federated querying where possible

While this is a tactical detail, remember you do not have to make a copy of the data into your own data store, one option is to query the datasets you need at source and have a virtual table /view of the data to process in your data store.

There are a few tools that can do that such as Apache Drill and Trino

5) Define a definition of ready for use cases - Do your due diligence

There is a saying "Before you attempt to beat the odds make sure you can survive the odds beating you". In the case of a security data lake, make sure you set some clear guidelines on what needs to be in place for a security data use case to be viable. A definition of ready might work for your teams, and limits the times you go down a wrong path.

Example:

  • Is the data for this use case generated from automated processes?
  • Is the data available and the data quality is sufficient for the use case?
  • Have the maintainers of the systems of record committed to the data quality requirements?
  • Have you proven the data model for the use case in order to ensure there is the necessary data points to solve the use case?

6) Self Service & Training - Build a community around it

Finally, make sure you have enough support for your security teams to make the most of these capabilities and keep them and the customers front and centre. If the data can be queried over SQL, done assuming everyone is proficient, some could also use a refresher. Same with BI tools like Tableau - do not assume, if we build it, they will come. Every step on the way counts.

References

Jonathan Cran

Founder | Product & Engineering Leader

2 å¹´

good overview of challenges. have talked with a couple security leaders that launched initiatives and eventually just fell back to Splunk or a more traditional SIEM approach. Curious what you think of Amazon's new security lake push (https://docs.aws.amazon.com/security-lake/latest/userguide/what-is-security-lake.html)... does it move the bar? is the built-in normalization helpful?

Craig Saunderson

CISO Advisory | Cyber Security Strategy

2 å¹´

Great article. A common discussion point I have with customers is; define the right use-cases at the outset (for security, observability, IT Ops etc), this in-turn defines what data you need and helps you understand if you have the data in the right format (obviously Splunk Common Information model can help with that). Importantly, having a process to sustain this approach (i.e. regular review boards etc), is important to keep things under control. Similarly, for security specifically, using things like Splunk's free Security Essentials app can help you test, track and monitor how your use-cases will perform, do you have the right data and can you align to frameworks such as Mitre (https://www.splunk.com/en_us/blog/security/using-mitre-att-ck-in-splunk-security-essentials.html). You might also be interested in the OCSF (https://docs.aws.amazon.com/security-lake/latest/userguide/open-cybersecurity-schema-framework.html).

赞
回复
Marc Van-De-Cappelle

Security Misfit at AWS & ISO 3103 Specialist

2 å¹´

This covers so many topics which need careful consideration when investing in data. Great article. ??

Tony Turner

VP Product - Frenos | Security Architect to Critical Infrastructure | Cyber Informed Engineering | Author | SANS SEC547 Defending Product Supply Chains Instructor

2 å¹´

Usually the data lake winds up being a garbage heap. Or a swamp filled with sadness. I think I lost my horse in that swamp. Oh wait, that was a movie.

Steve Springett

Software Supply Chain, Security Leader, Community Builder, Chair of CycloneDX SBOM Standard, Chair Ecma TC54, OWASP Global Board of Directors

2 å¹´

I like to bring in all the data from various security tools, defect trackers, threat models, version control, application portfolio management, security champion directory, security training platforms, PSIRT, etc, into a self-service platform like DOMO that offers datalake,ETL, and dynamic dashboards. But before I do any of that, it starts with the questions I’d like answers to.

要查看或添加评论,请登录

Sherif Mansour的更多文ç«