How to optimize (long term) storage of streamed time series data in the Cloud
Joeri Malmberg
Founder at Blackbird Cloud | Experienced Cloud Architect | Building secure cloud environments for companies | 24/7 availability | Sharing cloud development insights for cloud architects, DevOps & team leads
Streamed time series data can come from a variety of sources. However, wherever it comes from, it has the most value when it’s still fresh. There’s still value in storing non-current data, but this can never be overbalanced by the costs of storage.
?
So, in this newsletter, I’m going to outline a simple approach for the long-term storage of your streamed time series data.
?
This way, it remains as cost-effective as possible throughout its lifecycle, without sacrificing performance or functionality.
Getting colder…
?
Streamed data is typically held in warm storage in AWS Kinesis or Azure Event Hubs as partitioned data. This keeps it available for Consumer APIs, in case of poor connectivity or other disruptions to the data flow.
?
However, it will only keep data in warm storage for a limited time. So, to keep your non-current data it needs to be repackaged by the off-loader and sent to a cost-effective cold storage solution.
?
There are two options for this, depending on whether you’re using Kinesis or Azure.
The best solution for Kinesis: AWS S3
?
This will come as no great surprise, as S3 is a great solution for data lakes and other cold-storage needs.?
Kinesis and AWS S3 works perfectly with each other, using the Kinesis Firehose as the off-loader.
?
With this setup, data from Kinesis data streams can automatically flow into the highly scalable S3 storage.
Encryption:
Encrypting data in S3 is easily done by using a key from AWS Key Management Service, but you might also want to investigate options for client-side encryption too.
?
Access management:?
Access is easy to manage using S3 bucket policies via AWS Lake Formation.
?
Lifecycle management:?
You should use lifecycle policies that account for typical access patterns, and archive old data to the most cost effective solution.
?
For example, you can transition non-current data to S3 Standard-IA storage for ‘cool storage’, and then to S3 Glacier Flexible Retrieval for ‘cold storage’.
?
Lifecycle policies can be managed from the S3 console, AWS CLI, AWS SDKs, or with the REST API.
The best solution for Event Hubs: Azure blob storage
?
Event Hubs includes an integrated off-loader, called Event Hubs Capture. This gives you granular control over data capture intervals, and it works perfectly with Azure blob storage.
?
Blob storage offers a range of tiers, allowing you to automatically transition data to the most cost-effective solution for your needs. These range from hot to archive (icy cold).
?
领英推荐
One thing to be aware of: the colder the tier, the more expensive it is to retrieve. This must be balanced against the cost-savings of colder storage options.
??
Encryption:?
Azure blob storage offers automated options for encrypting your data at rest, as well as client-side data encryption.
?
Access management:?
There are many options for access management with blob storage: Shared Key, Shared access signature (SAS,) Anonymous read access, Storage Local Users, and Microsoft Entra ID.
?
Of these, Microsoft encourages the use of Microsoft Entra ID as the most secure option. This can be easily managed with Azure role-based access control (RBAC).
?
Lifecycle management:?
Blob storage enables you to use lifecycle management policies to transition your data from hot to cold storage, using a variety of rules.
?
These can transition data based on last access, time since ingestion, last modified, or index tags. You can also use it to delete data (or versions) based on these same criteria.
How to optimize these storage options
?
Lifecycle rules are the first and most important way of reducing the costs of long-term storage.
?
These ensure data is always held in the most efficient tier, and that data is never kept longer than necessary. This is important to ensure that costs don’t grow out of control.
?
However, lifecycle rules can be challenging to get right. They must match the use case, and access patterns very closely, otherwise you might be generating unnecessary costs.
?
Choosing the right file format for the off-loader is also quite impactful. Parquet, for example, is naturally more compact that CSV files, and this trims off some of the extra cost too.
?
And don’t forget about your analytics streams – these can reduce your storage needs by eliminating unnecessary data and summarizing data from multiple sources.
?
To get the most value from your analytics stream, you need to look at how your historical data is used, and if there are any opportunities to reduce storage needs in line with this.
?
For example, if Consumer APIs only use a particular data set to calculate an average value or specific insights, then this is all they need. You don’t need to store all the source data for this.
?
?
Want to dive deeper?
?
All the above will help keep your data streams efficient and cost-effective.
?
However, depending on your use-case, there may be other solutions that can greatly enhance the profitability and scalability of your cloud.
Join our upcoming ‘Tech Talks’ on LinkedIn Live this Thursday, to learn more about storing streamed time series data: https://www.dhirubhai.net/events/linkedinlive-storingtimeseriesd7229019167423639553/about/