登录查看更多内容

How to optimize (long term) storage of streamed time series data in the Cloud

Joeri Malmberg

Founder at Blackbird Cloud | Experienced Cloud Architect | Building secure cloud environments for companies | 24/7 availability | Sharing cloud development insights for cloud architects, DevOps & team leads

发布日期: 2024年9月10日

Streamed time series data can come from a variety of sources. However, wherever it comes from, it has the most value when it’s still fresh. There’s still value in storing non-current data, but this can never be overbalanced by the costs of storage.

So, in this newsletter, I’m going to outline a simple approach for the long-term storage of your streamed time series data.

This way, it remains as cost-effective as possible throughout its lifecycle, without sacrificing performance or functionality.

Getting colder…

Streamed data is typically held in warm storage in AWS Kinesis or Azure Event Hubs as partitioned data. This keeps it available for Consumer APIs, in case of poor connectivity or other disruptions to the data flow.

However, it will only keep data in warm storage for a limited time. So, to keep your non-current data it needs to be repackaged by the off-loader and sent to a cost-effective cold storage solution.

There are two options for this, depending on whether you’re using Kinesis or Azure.

The best solution for Kinesis: AWS S3

This will come as no great surprise, as S3 is a great solution for data lakes and other cold-storage needs.?

Kinesis and AWS S3 works perfectly with each other, using the Kinesis Firehose as the off-loader.

With this setup, data from Kinesis data streams can automatically flow into the highly scalable S3 storage.

Encryption:

Encrypting data in S3 is easily done by using a key from AWS Key Management Service, but you might also want to investigate options for client-side encryption too.

?

Access management:?

Access is easy to manage using S3 bucket policies via AWS Lake Formation.

Lifecycle management:?

You should use lifecycle policies that account for typical access patterns, and archive old data to the most cost effective solution.

For example, you can transition non-current data to S3 Standard-IA storage for ‘cool storage’, and then to S3 Glacier Flexible Retrieval for ‘cold storage’.

Lifecycle policies can be managed from the S3 console, AWS CLI, AWS SDKs, or with the REST API.

The best solution for Event Hubs: Azure blob storage

Event Hubs includes an integrated off-loader, called Event Hubs Capture. This gives you granular control over data capture intervals, and it works perfectly with Azure blob storage.

Blob storage offers a range of tiers, allowing you to automatically transition data to the most cost-effective solution for your needs. These range from hot to archive (icy cold).

领英推荐

A January Jumble of Data

Lori MacVittie 2 年前

Shortcuts and Mirroring at Microsoft Fabric

Pablo Junco Boquer 1 年前

Securely Transferring Sensitive Data Between Clouds

Lyftrondata 2 个月前

One thing to be aware of: the colder the tier, the more expensive it is to retrieve. This must be balanced against the cost-savings of colder storage options.

Encryption:?

Azure blob storage offers automated options for encrypting your data at rest, as well as client-side data encryption.

Access management:?

There are many options for access management with blob storage: Shared Key, Shared access signature (SAS,) Anonymous read access, Storage Local Users, and Microsoft Entra ID.

Of these, Microsoft encourages the use of Microsoft Entra ID as the most secure option. This can be easily managed with Azure role-based access control (RBAC).

Lifecycle management:?

Blob storage enables you to use lifecycle management policies to transition your data from hot to cold storage, using a variety of rules.

These can transition data based on last access, time since ingestion, last modified, or index tags. You can also use it to delete data (or versions) based on these same criteria.

How to optimize these storage options

Lifecycle rules are the first and most important way of reducing the costs of long-term storage.

These ensure data is always held in the most efficient tier, and that data is never kept longer than necessary. This is important to ensure that costs don’t grow out of control.

However, lifecycle rules can be challenging to get right. They must match the use case, and access patterns very closely, otherwise you might be generating unnecessary costs.

Choosing the right file format for the off-loader is also quite impactful. Parquet, for example, is naturally more compact that CSV files, and this trims off some of the extra cost too.

And don’t forget about your analytics streams – these can reduce your storage needs by eliminating unnecessary data and summarizing data from multiple sources.

To get the most value from your analytics stream, you need to look at how your historical data is used, and if there are any opportunities to reduce storage needs in line with this.

For example, if Consumer APIs only use a particular data set to calculate an average value or specific insights, then this is all they need. You don’t need to store all the source data for this.

Want to dive deeper?

All the above will help keep your data streams efficient and cost-effective.

However, depending on your use-case, there may be other solutions that can greatly enhance the profitability and scalability of your cloud.

Join our upcoming ‘Tech Talks’ on LinkedIn Live this Thursday, to learn more about storing streamed time series data: https://www.dhirubhai.net/events/linkedinlive-storingtimeseriesd7229019167423639553/about/

Blackbird Cloud

207 位关注者

要查看或添加评论，请登录

Joeri Malmberg的更多文章

Tackling the Capital Crunch with Cloud HPC

2025年3月10日

Tackling the Capital Crunch with Cloud HPC

Welcome to the first part of our 3-part series, where we delve into the core challenges startups and scaleups face when…

1 条评论
Limitless scalability, on-demand, and reliable: 5 steps to set up high-performance computing (HPC) in the cloud.

2025年1月28日

Limitless scalability, on-demand, and reliable: 5 steps to set up high-performance computing (HPC) in the cloud.

When it’s set up correctly, cloud infrastructure can offer incredible performance and cost-savings for a multitude of…
Avoid unnecessary work with your next cloud project – set it up for easy management and rapid updates

2024年12月2日

Avoid unnecessary work with your next cloud project – set it up for easy management and rapid updates

Some cloud projects that start out small can end up growing into highly successful applications. Awesome, right?…
When should you use platform engineering and what’s the best approach?

2024年10月22日

When should you use platform engineering and what’s the best approach?

The possibilities of the cloud are nearly endless – but it comes with a bewildering array of tools, methods, and…
What is platform engineering and how can you do it effectively?

2024年10月1日

What is platform engineering and how can you do it effectively?

Cloud-based infrastructure can support highly complex processes and incredible scalability. However, the rapid growth…
Optimizing the storage of streamed time series data in the cloud

2024年8月13日

Optimizing the storage of streamed time series data in the cloud

In our previous blog, we examined how you can build a resilient and scalable process for time series data streams and…
Building high performance time series data processing and storage in the cloud with Azure: analytics streams

2024年7月30日

Building high performance time series data processing and storage in the cloud with Azure: analytics streams

Time series and event-based data streams are rapidly growing in importance. More than ever, these data streams are a…
Enhancing Code Security with Amazon Q Developer’s AI-powered Productivity Tool

2024年7月16日

Enhancing Code Security with Amazon Q Developer’s AI-powered Productivity Tool

We all know that hand-coding takes time. Time that we rather spend otherwise.
Enhancing Code Security with Amazon Q Developer’s AI-powered productivity tool

2024年6月25日

Enhancing Code Security with Amazon Q Developer’s AI-powered productivity tool

There are few things more satisfying than completing a software project - but getting there can take some serious work.…
Cloud Disaster Recovery Planning

2024年6月12日

Cloud Disaster Recovery Planning

A 3-step guide to the restore process by making- and restoring backups. Step 1.

See all articles

How to optimize (long term) storage of streamed time series data in the Cloud

Joeri Malmberg

Founder at Blackbird Cloud | Experienced Cloud Architect | Building secure cloud environments for companies | 24/7 availability | Sharing cloud development insights for cloud architects, DevOps & team leads

Getting colder…

The best solution for Kinesis: AWS S3

?

The best solution for Event Hubs: Azure blob storage

领英推荐

How to optimize these storage options

Want to dive deeper?

Blackbird Cloud

207 位关注者

Joeri Malmberg的更多文章

社区洞察

其他会员也浏览了

Adapting to Change with Data Patterns on AWS

Time series (Tick) Databases with Native Cloud Technologies and Data Validation

Data boundaries are blurring in a multi-cloud world

Why & When to use Firestore and Bigtable

AWS S3

Review of Three Data Lake Technologies #innovation #technology #datalake

Data Nugget April 2024

TriMark Introduces New Internal Data Cloud? (Kariba Pt. 1)

The Confluent Data in Motion Tour came to Singapore!

ARTIFICIAL INTELLIGENCE STORAGE ARCHITECTURE

Getting colder…

The best solution for Kinesis: AWS S3

?

The best solution for Event Hubs: Azure blob storage

领英推荐

How to optimize these storage options

Want to dive deeper?

Blackbird Cloud

207 位关注者

Joeri Malmberg的更多文章

Tackling the Capital Crunch with Cloud HPC

Limitless scalability, on-demand, and reliable: 5 steps to set up high-performance computing (HPC) in the cloud.

Avoid unnecessary work with your next cloud project – set it up for easy management and rapid updates

When should you use platform engineering and what’s the best approach?

What is platform engineering and how can you do it effectively?

Optimizing the storage of streamed time series data in the cloud

Building high performance time series data processing and storage in the cloud with Azure: analytics streams

Enhancing Code Security with Amazon Q Developer’s AI-powered Productivity Tool

Enhancing Code Security with Amazon Q Developer’s AI-powered productivity tool

Cloud Disaster Recovery Planning

社区洞察

其他会员也浏览了

Adapting to Change with Data Patterns on AWS

Time series (Tick) Databases with Native Cloud Technologies and Data Validation

Data boundaries are blurring in a multi-cloud world

Why & When to use Firestore and Bigtable

AWS S3

Review of Three Data Lake Technologies #innovation #technology #datalake

Data Nugget April 2024

TriMark Introduces New Internal Data Cloud? (Kariba Pt. 1)

The Confluent Data in Motion Tour came to Singapore!

ARTIFICIAL INTELLIGENCE STORAGE ARCHITECTURE