How We Saved 20x Storage Costs by Fixing Our AWS S3 Glacier Implementation: A Deep Dive

How We Saved 20x Storage Costs by Fixing Our AWS S3 Glacier Implementation: A Deep Dive

Managing IoT data at scale is quite the adventure - from device registration to storage, processing, and historical analysis, each step brings its own set of challenges. Recently, our team dove into a storage optimization project that started as a simple cost-cutting exercise but ended up teaching us valuable lessons about AWS S3 storage management.

Quick disclaimer: This was very much a "fix-it" operation. Sure, we could have designed things better from the start, but let's be honest - sometimes you inherit decisions or make choices that need revisiting. Here's our story of turning a storage mess into a win for our infrastructure.

The Architecture

Our initial setup was straightforward:

  • Hundreds of IoT devices streaming sensor data (temperature, humidity, noise levels) to an AWS IoT endpoint over MQTT
  • An AWS IoT rule routing each payload (100-300 bytes) to:

a. A Kinesis Data Stream for real-time processing

b. An S3 bucket for long-term data retention, where each payload became a single S3 object

  • A bucket lifecycle policy that automatically transitioned objects older than 90 days to Glacier Flexible Retrieval - a decision that would later prove problematic

The Wake-Up Call

The true cost of our storage decisions remained hidden among our overall AWS spending until we needed to integrate historical data with our reporting system. Like many teams, we turned to Amazon Athena for these ad-hoc queries. That's when things got interesting.

A deep dive into AWS Cost Explorer revealed alarming spikes in S3 costs during Athena queries. The culprits? Request-Tier2 and Request-Tier3 charges. But the real shock came when we examined our bucket metrics:

? GlacierObjectOverhead: ~57 GB

? GlacierS3ObjectOverhead: ~13.3 GB

? StandardStorage: ~2.7 GB

? GlacierStorage: ~666 MB

Talk about bad storage management

To put this in perspective: we were using about 3.5GB of actual storage with a staggering 20x overhead. Here's why:

a. 32 KB of index/metadata storage (charged at Glacier rates)

b. 8 KB of S3 metadata storage (charged at Standard rates)

The math was painful: with approximately 10 million objects, we were:

  1. Paying for massive metadata overhead
  2. Incurring frequent Glacier retrieval costs
  3. Getting zero benefit from Glacier storage due to our small object sizes

Three key realizations emerged:

  1. Our transition to Glacier was unnecessary for our use case
  2. The overhead costs far outweighed any storage savings
  3. We needed to revert everything to Standard storage for our reporting needs

The Solution: A Three-Step Migration

Since AWS doesn't provide a direct path from Glacier to Standard storage, we developed a three-phase approach using several AWS services. Here's how we did it:

Phase 1: Object Discovery

  • Implemented S3 Inventory Reports to catalog our bucket

a. Generated daily reports of all objects and their storage classes

b. Configured CSV format output to a separate bucket

Puzzle piece number one - S3 inventory report

  • Used AWS Glue Crawler to create a table schema
  • Queried the inventory with Athena to identify Glacier objects:

SELECT bucket, key FROM "sensor_inventory"."data" where storage_class = '"GLACIER"';        

Phase 2: Glacier Restoration

  • Used the CSV output from our Athena query as input for S3 Batch Operations
  • Initiated bulk restore operations with a 14-day restoration period

a. The extended period gave us buffer time for unexpected issues

b. Files would temporarily return to Standard storage

We chose 14 days to give us enough time in case something else popped up

Phase 3: Storage Class Migration

  • Launched another S3 Batch Operation using the same CSV
  • Configured in-place copy operations:

I. Source: Restored objects

II. Destination: Same bucket/key

III. New storage class: Standard

  • Important note: This process updates the "Last modified date" of all objects. If your application logic depends on this timestamp, plan accordingly.

The Plot Twist: Versioning Complications

Just when we thought we were done, the bucket metrics told a different story - our storage usage hadn't decreased. After some investigation, we discovered an overlooked detail: bucket versioning was enabled. Our copy operations had created new Standard versions while preserving the old Glacier versions as noncurrent. Back to the drawing board!

Fortunately, AWS had already thought of this scenario. The solution was straightforward: S3 Lifecycle Rules. We added a simple rule to clean up noncurrent versions:

Lifecycle rules can take up to 48 hours before they are executed

With this final piece in place, our storage optimization was truly complete. The noncurrent Glacier versions were cleaned up automatically, and our metrics finally showed the expected reduction in storage usage.

Lessons Learned

Our storage optimization journey revealed several key insights that might help others avoid similar challenges:

  • Size Matters

i. Small objects (under 128KB) are rarely cost-effective for Glacier storage

ii. AWS now enforces this best practice by preventing transitions for objects under 128KB

iii. Always consider object size distribution when planning storage lifecycles

  • Hidden Costs

i. Storage class metadata overhead can dwarf actual storage costs

ii. Each Glacier object carries 40KB of overhead (32KB + 8KB)

iii. Regular cost analysis is crucial, especially as data volumes grow

  • Best Practices

i. Analyze your data patterns before implementing lifecycle rules

ii. Consider consolidating small objects before archival

iii. Document your storage decisions and their rationale

iv. Regularly review AWS's latest service updates and constraints

  • Process Insights

i. S3 Inventory Reports are invaluable for large-scale storage analysis

i. Always check versioning settings before bulk operations

i. Plan for sufficient restoration time when working with Glacier

iv. Test your migration process with a small subset first

This experience taught us that while AWS provides powerful storage options, their effective use requires understanding both the technical details and economic implications. What started as a costly oversight became a valuable learning opportunity, leading to better storage management practices across our organization.

Remember: sometimes the best way to learn cloud best practices is to clean up after not following them. As AWS's recent 128KB transition constraint shows, even cloud providers learn and adapt their services based on customer experiences like ours.



