Time-series databases for industry
David Degerfeldt
Principal Solutions Architect in Artificial Intelligence at Ductus
For a long time now, we've heard that "data is the new gold". This sounds a bit cheesy, and becomes straight out funny when applied to the mining industry. But nevertheless, there is some truth too the statement. Today, it is generally accepted that accurate and accessible data is the foundation to build data driven solutions, with AI as a particular driver. But how should it be realized?
In this post, I intend to apply my experience from the mining industry to the challenge of choosing time-series databases for industrial applications. My hope is to provide guidance on a strategy and what to consider when deciding which database to choose for your application. There will be no "this is the best database", as there is no easy answer to something that is complicated and specific to each use-case.
Background
Industries focus on the core business, which never includes operating distributed computer centers with clustered database solution. This is natural and probably well prioritized. IT solutions is thus views as support systems, and each department obtains their own: Sales has their CRM, operations use a MES/SCADA, shipping relies on the LMS, maintenance uses CMMS, and so on. This approach inevitably leads to a situation where the business data is dispersed and only available within each system.
In the process industry, the MES/SCADA systems are central to the business and it is common to find historians that are capable to store large amounts of process data for analysis and reporting. Examples here include AVEVA Pi (Formerly Osisoft Pi) and AspenTech InfoPlus.21.
The digitalization movement, which tries to realize the data-is-gold one-liner, is the dominating way industries currently try to improve efficiency and grow. This is often headlined Industry 4.0 or digital transformation. Essentially, the basic idea is to take data from several systems, combine it and make more advanced analytics to automate more tasks and make better decisions.
In practice, implementing this is not easy. The different systems comes from different vendors, often lack APIs, and may not even store the data required for the new application. Furthermore, the different systems often have drastically different data models. Process data is often time-series data sampled at regular intervals, while economy and procurement data is transaction based. An obvious workaround for this is to extract the relevant data and store it in a separate database that can be controlled and used cross-functionally.
- But wait!, someone says, our process historian can already do that!
And it probably can. Most likely, it will also do it quite well! At least to begin with. Integrating the first couple of systems with the historian goes smoothly, but as more systems are included, the integration environment quickly becomes complex and expensive to maintain. (There are modern ways to avoid this integration soup, that I probably will write a separate post about)
Additionally, the license model for these historians typically include an annual price per tag (data-point), which could be as much as $1 per tag/year. This is acceptable when limited to process data, which typically is in the order of 100,000-1,000,000 signals. As data is sourced from other systems and more process data points are included, this number grows with several orders of magnitude, which quickly draws attention during budget negotiations.
Reasonable solutions
There is much that can be done about this situation, but unfortunately, none are quick fixes. Replacing the historian with a cheaper one might be a tempting as it obviously is associated with a huge recurring cost. The options for this approach is what I will discuss further in this article.
First of all, for the scenario described above, I would advice not to replace your historian with another. It will lead to the same problems again. Instead, make use of the historian the way it was intended to: Get rid of those other integrations, and let the historian shine with its process data!
I would also advice against the idea of a grand central database that store all of our business data in a single place. I know that the idea is appealing, but in reality, it is not that easy for a fundamental reason: Different applications pose different requirements on the data format. If you rely on a single data storage, you will find yourself transforming the data, and even storing it in a separate database tailored for your application. This means that you will anyway create additional databases with subsets of the data in slightly different formats. Why not go there directly? So what if some data is stored in multiple databases? It is not the extra storage that will kill your business, it is when you fail with digitalization.
Time-series databases
Why do we need a time-series database (TSDB), and not a standard relational SQL database? What makes them different?
Well, you might come far with a "standard" SQL relational database, but there some reasons why a TSDB is relevant:
Choosing a time-series database
There are many aspects that goes into choosing a database. Traditionally, these tend to focus on the technical aspects of the database, but here, I propose a slightly different approach that is based on the lifecycle and management of IT components in industry.
Purpose
Before you start looking at different options, it is essential that you are clear on what the data is for. What higher purpose than storing data does it serve? Who will consume the data, how will the consumers use that data? Are the intended users involve in the process? If those questions are tough to answer, do not proceed! Involving the end users in the process is the best way to ensure that value is generated.
Development
In the first phase, the key aspect is to how to get things working. This means that your database should be easy to deploy. Here, fully managed cloud based alternatives are attractive, if an in-house datacenter is not available.
Still, it is common to have operations that suffer from unstable internet connections and long periods of isolation, particularly in the mining industry. Here, an on-premise solution is preferable, and your database must support the infrastructure at hand.
At this phase, it is also crucial to consider the consumers, the end-users, of the database. If the database should feed Power BI analytics, pick a database that already have a suitable connector. You still have the option to avoid wasting time and money on a complex integrations!
Authentication and security must also be considered early on, as it may be a show-stopper if addressed too late. This also depends on how the data is intended to be used. For databases that serve a single or few specific applications, it usually enough with some simple authentication mechanism. But if multiple users need access, you will need integration with a central access and identity management, e.g. an Active Directory.
Production
After the database is deployed, it has to be maintained. This part is often neglected during the development phase, which may lead to significant costs later on. One reason for this is that industry often execute these kinds of investments in projects, where a project implements the change, and hands it over to a separate governance organization that had little or no influence on the decisions during the development phase.
Here, the stability of the solution is critical. Are there frequent updates to the product? Is the supplier stable?
The performance of the database, in terms of ingest rate and data access comes into play as more data consumers and producers are integrated. Industrial applications however, rarely challenge modern time-series databases in these aspects, as they are designed for even tougher requirements. In particular, databases that support horizontal scaling in clusters, can grow with the challenge.
There are also aspects related to the organization that should not be neglected. Who will maintain the database? Do we have committed and competent staff? Are they familiar with SQL/NoSQL?
Agile approach
My final advice when deciding for a time-series database is not to make a hard decision. Begin with one approach and see how it turns out, evaluate, make a new decision and repeat. This iterative approach is central to the agile ways of working that are standard in software development. With an agile team that is responsible for both development and governance, it is easier to arrive with solutions that hold over time. Furthermore, collaboration with end-users are central to agile methods, which is crucial to ensure that value will be generated.
Overview of some time-series databases
InfluxDB
This is one of the most well known Open Source time series-databases out there, that has undergone some big changes over time. The first version (1.0) was indeed a gamechanger for many, and became the de-facto standard in the open source community. It uses a schema-less data model, which makes it easy to add new fields to datasets, without modifying any schema or migrating data. With the release of 2.0, they moved away from the SQL-like InfluxQL query language to Flux, which was something completely different.
The new 3.0 version, or Influx Edge, is completely rewritten in Rust and sports yet another new query language. This time however, they employ Apache Arrow DataFusion as the core engine and provides a SQL interface, which will make it easier for all the users that are already familiar with SQL. It will also be available in a bunch of variants that come with different licensing and features. As of now, it is not clear how much these variants will actually cost, but Influx promises both open source, free, and clustered solutions.
This new version is though still in development, and even though it has the potential of being an excellent component in a data analytics stack, its value remains to be proven.
Timescale DB
This is also one of the large players in time-series databases. It is an extension to the most popular open source database PostgreSQL. As such, users familiar with traditional SQL databases will feel very much at home with Timescale. It claims to be significantly faster than Influx as the number of clients increase. Both solutions can easily handle beyond 100,000 metrics per second with millions of simultaneous clients (high cardinality). This is far beyond what one would require for typical process industry applications.
It is the rich functionality that comes with Timescale+PostgreSQL that makes it an interesting choice. One particular feature is continuous aggregates that incrementally refresh queries on long-term datasets in the background, so that when the query is run, only the changes, and not the entire dataset, need to be computed. Another is the ability to efficiently combine time-series data with relational data in the same database. This is beneficial for application/domain specific databases, that otherwise would require an additional database engine.
Citus is an extension for PostgreSQL that implements horizontal scaling through sharding, where the database is distributed across multiple servers, enabling parallel processing of both ingest and queries. Unfortunately, this extension is not compatible with Timescale, which renders timescale limited to single host. For most industrial applications, this is though not an issue.
Microsoft Azure supports Timescale in their Azure Database for PostgreSQL flexible server offering.
PostgreSQL excels when there is a need for OLTP operations, i.e. row based operations like INSERT and UPSERT or triggers.
QuestDB
This database has baffled me. It is written in Java, with extensive memory management to handle the garbage collection that often drains the performance of Java applications. Still, it manages ingestion performance beyond InfluxDB and TimescaleDB under many conditions.
Additionally, it has a rich set of time-series functions, including joining tables based on the nearest timestamp. At the same time, it lacks easy ways to manipulate data. It uses a JIT compiler to efficiently parallelize query filters. Another interesting detail is that the data stored in Parquet files in a generic
Currently, QuestDB has found its largest market in the financial sector, where these features are critical, but the same are ideal for storing process and IoT sensor data.
The open source variant has since 2020 received strong growth and the development seems very active, with around 30 pull requests per month. As a comparison, InfluxDB only sees about half as much since 2023.
The enterprise version can be deployed on-premises or in the cloud. On interesting upcoming feature announced is cold storage in Parquet format. Essentially, this means that data will be available in an object storage as parquet files, which are easily imported in big data analytics, e.g. for AI training and development.
TDEngine
In this comparison, this is the product that most clearly target the industrial sector, even claiming it to be "the Next Generation Data Historian ... built for Industry 4.0 and Industrial IoT". Notably, the enterprise version (not open source) supports data ingestion directly from industry protocols like MQTT and OPC-UA, can integrate with the Aveva Pi-echosystem, including third party companions like the data analysis tool Seeq.
Development seems to be very rapid, seeing more than 200 pull requests almost every month since 2020. It seems to be heavily used and supported by major Chinese companies like Alibaba, ByteDance, Tencent and Baidu. To some, this might pose a problem by itself.
Other databases to consider
Many applications involving time-series data may still be better suited for another kind of data storage. Apache Druid and ClickHouse for instance are OLAP databased, intended for analytics primarily, but still with excellent time-series support.
MongoDB
This is a the most popular document database that has been around since 2015. With version 5.0, they introduced Time Series Collections that provides time-series functionality to MongoDB. It has excellent integration with Apache Spark and Kafka, making it a strong player in AI applications.
MongoDB is both available in an open source Community Edition, a hosted environment in MongoDB Atlas, and as an Enterprise Advanced edition intended for on-premises deployment.
Microsoft's cloud native Cosmos DB has a wrapper that provides a Mongo DB compatible interface to the Cosmos DB. This may be useful for applications that you wish to transfer to the Azure cloud environment
Apache Druid
This solution is intended to power real-time analytics that require fast queries at scale and under load on streaming and batch data. It may be overkill for many applications, but it will shine when there are huge datasets that need analytics, which often is the case for production data. Clustered by design means it is scalable to ridiculousness. Data can be ingested from several blob storages and file formats and streamed directly from Apache Kafka.
Development activity is on the rise, which indicates that the product is growing and will be around for a while.
ClickHouse
Among the databases listed here, this undergoes most development with over 450 pull requests per month. Like Apache Druid, it supports shard processing and infinite horizontal scaling over multiple servers. It has specialized codecs that enables it to at least challenge the niched time-series databases. A powerful piece of software, when applied to a challenging task.
It uses SQL for queries and has a long list of supported data sources like EMQX and Kafka, and integrations, including tools like Grafana, Power Bi, and Tableau.
Conclusion
In this post, I've outlined some guides on what to consider when investing in time-series databases for an industrial setting. I argue for an agile approach where solutions are refined iteratively in close cooperation with the end users. There should be no central master database, but several databases tailored for one or a few consumers (application). No recommendation on specific databases are given, but a few interesting options are presented together with brief comments.
Section manager Process Control and Analysis
5 个月Great read! And I encourage you to write next post faster! This sentence: "I know that the idea is appealing, but in reality, it is not that easy for a fundamental reason: Different applications pose different requirements on the data format." Could have saved me like 6 months headache trying to force multiple unsuitable sources through our time-series historian. Once I abandoned that idea, we made the same amount of progress in a week as we had in the previous six months. ??
Program Manager - Artificial Intelligence in Mining
5 个月I think you have captured a really nice short-list of things to consider when addressing this topic and this is something more companies, regardless of industry, need to understand. Thank you for a very insightful article, David!
Business developer p? Polypodium AB
5 个月Really relevant and to-the-point content David!! It is so important to create that straight-put short-list of parameters, that are the most important factors, when choosing the path forward....
Chemical Engineer, PhD in plant and process engineering. Global Solution Department at TrendMiner
5 个月Awesome summary. Thank you very much. I really learned a lot
Principal Solutions Architect in Artificial Intelligence at Ductus
5 个月Ping Fredrik Hases, Magnus Gens, Rasmus Tammia!