How Not To Spend Half a Million Dollars on Logs

How Not To Spend Half a Million Dollars on Logs

by Cliff Crosland


A month ago, we started playing with a fun data set to push the limits of our product some more. We decided to dial it up to 11 and indexed a data set of 100 billion synthetic AWS CloudTrail log events with a cumulative size of 250TB.

Why 250TB? This is the data set size at which standard log tools start to cost more than half a million dollars per year. Ouch. We wanted to see how much Scanner could beat that price, how easy it really was to onboard, and whether Scanner’s log search remained fast at that scale.

To get more detailed pricing estimates for standard log tools, let’s assume we ingest the 250 TB data set in one year, or 700GB/day on average. Given this volume, here are some estimates of how much the standard tools would cost:


Azure Sentinel

  • Committing to 700GB/day would set the unit cost to be somewhere between $2.87/GB and $2.93/GB. So the annual cost would be between $733,285 and $748,615.

IBM QRadar

  • Pricing starts at $2.14 per GB/day, so ingesting 700GB/day would cost $546,770 annually.

Splunk Cloud

  • It’s not very easy to find pricing for Splunk, but according to the AWS Marketplace, an ingestion volume of 100GB/day costs $80,000 per year. At this rate, 700GB/day would cost $560,000 per year. The unit price per GB might be a little less at this higher scale, but the total price is probably close to half a million dollars.

Datadog

  • Ingestion is priced at $0.10 per GB, and indexing is priced at $2.50 per 1 million events stored in the index. Ingestion isn’t too expensive: 250TB costs $25,600. But indexing all 100 billion events would cost $250,000 total per month or $3M per year. That’s a crazy high cost, so let’s say we decide to keep only 20% of the logs in the index, sacrificing fast search and visibility for 80% of our logs. That would still cost $600,000 per year.


Across the board, you’re looking at an annual log bill above half a million dollars.

How does Scanner’s cost compare? By indexing 250TB of logs into Scanner, we were able to answer this core question and others:

  • Cost: Will it cost a fortune to use Scanner to index 250TB of logs? Or will it cost materially less than the industry standard – half a million dollars?
  • Onboarding ease: Will it be easy, or painful, to onboard 250TB of logs? Does it require additional data engineering work, or can Scanner just handle the raw files?
  • Speed: Will Scanner stay fast, or will it lag, if we search for indicators of compromise (like malicious IP addresses) on 250TB of logs?


Cost: 80-90% less than standard tools.

AWS CloudTrail logs are a useful data source for security teams because they capture basically all of your organization’s AWS API calls, but the challenge teams frequently face when using CloudTrail logs is the extremely high volume. At enterprise scale, it’s fairly common to see CloudTrail log volume approach 1TB/day, and as mentioned above, this will likely cost at least half a million dollars per year to support in standard log tools.

During our indexing experiment, we validated that our infrastructure usage costs stayed extremely low. Here are some data points:


Throughput

  • It took about 3 days to index 250TB, which is 83TB / day. We learned how to do this again and set things up to finish in 1 day, but 3 days was plenty fast for this experiment.

Storage

  • Reads: There were 60 million S3 read operations. The read volume was 21TB of compressed data, which was 250TB uncompressed.
  • Writes: There were roughly 100 million S3 write operations to create and merge Scanner index files. By the end, the number of index files converged to slightly less than 100 thousand, and they consumed about 36TB of S3 storage space.

Database

  • MySQL RDS Aurora Serverless V2: We ran less than a dozen ACUs of compute at peak. This database managed all of the metadata about the S3 files in play.

Compute

  • ECS Fargate and Spot compute: I’ll be a little vague here so as not to give away too much of our internal cost information. Let’s just say that, while competitors probably use thousands or tens of thousands of vCPUs to ingest 83TB of logs per day, we needed to use… fewer. A lot fewer. We were pleasantly surprised at how few vCPUs we needed.
  • Detection rules: From this and other experiments, we learned that running a large number of detection rules (like 1000+) on incoming log data, will increase the amount of required compute somewhat but not too significantly.


What is the upshot of this usage data? Scanner’s infra remained cost effective at a meaningfully large scale, especially the indexing compute that we’re being a little secretive about.

We determined that, depending on the features you need to enable and the scale of your detection rule set, Scanner’s customer-facing price can be comfortably in the range of $50k to $100k for this 250TB data set. In other words, 80-90% less than standard tools.

To me, this feels like the price that logs should always have had. A log tool should augment your team, not be more expensive than team to pay for. It also seems right that a log tool should charge a handful of dimes per gigabyte, not a handful of dollars.

Onboarding ease: Trivial if your logs are in S3. A little bit of work if they aren’t.

For this 250TB synthetic CloudTrail data set, we created the same sort of file structure that AWS CloudTrail does natively. In particular, we created about 60 million S3 objects in an S3 bucket, with the key structure looking like this:

We also tried onboarding this data set into a few other tools that have native S3 integrations, and we experienced varying degrees of success and pain:

Amazon Athena.

  • It required a few attempts to get the SQL schema definition right, but it was not too bad. It took maybe 30 minutes of reading docs, defining a schema, running some queries, discovering that the schema was slightly out of date and needed some tweaks, iterating, etc. Fairly straightforward to onboard.

Snowflake

  • This required a surprisingly large amount of data engineering work, a day’s worth of one engineer’s time, and we decided to bail after getting just 15TB of the 250TB data set into Snowflake.
  • The JSON files were too large for Snowflake to parse into memory, so we had to write code to parse each CloudTrail file and re-write it as a newline delimited JSON file, with one log event per row.
  • Also, the 60 million S3 keys overwhelmed Snowflake, which requires that the full list of S3 key strings consumes no more than 1GB of memory. We had to configure Snowflake to import a small subset of the keys to get around this limitation.
  • Given these headaches, and the increasingly high cost in cost to load the data into Snowflake, we decided to import just 15TB of logs instead of the full 250TB. But that still sufficed for the purposes of our experiment.


Onboarding logs into Scanner was about as easy as onboarding logs into Athena, and far easier than onboarding logs into Snowflake. Here are the steps we took in Scanner:

  • We created an IAM role in the AWS account with the proper permissions and give Scanner permission to assume the role. It’s straightforward to do with CloudFormation, Terraform, and Pulumi, and the process is described in our docs.
  • Then, we configured an import match rule in Scanner to read from this bucket, use gzip decompression, and parse the JSON data. No need to set up any table schemas – Scanner could index the semi-structured data files in their raw form.
  • Finally, we kicked off indexing. Scanner listed all the files in the bucket and enqueued them for indexing, which ran for 3 days.


In all, setting up the integration and kicking it off took maybe 10 minutes all together, so for logs like CloudTrail that are already in S3, the onboarding process in Scanner is pretty much trivial.

If you don’t have logs in S3 already, you will need to do more work. Thankfully, many security-related data sources already have integrations to push logs to S3. So at least for these tools, onboarding onto Scanner is easy. Here are some examples of log sources that have built-in integrations to push logs to S3 buckets:

  • AWS CloudTrail, S3 access, VPC flow
  • Cloudflare
  • Crowdstrike FDR
  • Fastly
  • Github Audit
  • Jamf
  • Okta
  • VMware Carbon Black Cloud
  • And more…


For logs that aren’t in S3 already, we recommend trying out Cribl and Vector.dev. Over time, we want to make it easier for Scanner to automatically pull from the most frequently used security log sources, which often have export API mechanisms. But for now, you will still need to do some work to get such logs into S3.

Speed: We started playing with our logs in a whole new way

When it comes to speed, it’s probably best to let the product speak for itself. In this video, we used Scanner to query the 250TB synthetic CloudTrail data set, and for the sake of comparison, we also ran queries in Amazon Athena and Snowflake.

Why only test the data set in Scanner, Athena, and Snowflake, and not also in those standard tools mentioned earlier, i.e. Azure Sentinel, QRadar, Splunk, and Datadog? It’s pretty simple – do you have a few million dollars lying around to burn?


Demo Video: Scanner (250TB) vs. Athena (250TB) and Snowflake (15TB)

Demo Video: Scanner (250TB) vs. Athena (250TB) and Snowflake (15TB)


Here are some of the highlights from the video:

We ran a query in Athena to look for indicators of compromise on 250TB of logs.

  • The query did not finish before the end of the video.
  • Not shown in the video: the query actually times out after 60 minutes and costs a few hundred dollars. So don’t try this at home.

We ran a query on Snowflake to find the same indicators of compromise in a much smaller data set: only 15TB.

  • The query did not finish before the end of the video.
  • Admittedly, the query would have finished faster if we had used a far larger warehouse, like a 3XL, but running that warehouse is 64 times more expensive, around $128 – $192 per hour.

Then in Scanner, we ran several queries, like:

  • Searching for indicators of compromise, specifically a set of malicious IP addresses, over all 250TB of logs. Finished in just 22.8 seconds.
  • Searching for all activity from a specific AWS access key over all 250TB of logs. Finished in just 16.8 seconds.
  • Searching for the top IAM users who made AWS API calls over a 12 hour time window.
  • Searching for all activity for an outlier IAM user, computing counts aggregated by event source and name. Found that the user was running many DeleteObject S3 operations.
  • Querying to see which S3 buckets an outlier IAM user was deleting from.


Since Scanner is so fast, we found ourselves playing with this data set in new ways.

Given that it’s actually feasible to look for indicators of compromise across 1 year of logs in Scanner, not just the last 90 days, we could explore the full data set without heavy up-front planning.

For example, doing a big historical log scan in Amazon Athena to find indicators of compromise would have probably required a few days of planning and maybe one week of execution, job babysitting, and fixing bugs.

In Scanner, scanning the full data set can be done in one sitting in an unplanned, ad-hoc way. While you’re in there, if some suspicious data catches your eye, you can keep playing around with the data and investigate, testing hypotheses rapidly.


This is how 250TB should feel. Light as a feather.

Typically, once you start to generate close to 1TB of logs per day, managing logs starts to feel… well, heavy. The financial burden starts to become serious, and annual costs reach $500k to $1M. You need to start twisting your log pipelines into knots to filter out data to keep costs under control. You likely lose visibility into time ranges older than 90 days.

This is actually a pretty silly state of affairs, and it’s caused by the fact that standard log search tools still use an architecture designed for the on-premise era, running indexing clusters that couple storage and compute together on each machine. This is an order of magnitude more expensive to run and scale.

By leveraging a better architecture with storage and compute decoupled, Scanner:

  • Reduces log costs by 80-90%.
  • Makes log onboarding easy with data engineering work minimized, and semi-structured / flexible schemas supported naturally.
  • Rapidly executes search queries, like finding a set of indicators of compromise in 250TB of logs in less than 30 seconds.


If you’re reading this and think that Scanner could help out with your logging projects, we’d love to chat and see if it works for your use cases. You can visit us at our website and reach out to book a demo. Thanks!


要查看或添加评论,请登录

社区洞察