登录查看更多内容

Apache IoTDB: a timeseries database

Ali Farahani

PhD candidate in Artificial Intelligence

发布日期: 2024年6月13日

Imagine you need to collect and store data from a medium-sized solar farm, generating approximately 7000 data points every second. How would you manage this in a relational database like MySQL, Oracle DB, PostgreSQL, or even in a NoSQL database like MongoDB? How powerful would your database server need to be to handle such a high write throughput?

Let's suppose you have solved the storage issue. You would accumulate over 25 million records per hour and more than 18 billion records each month. How would you query this data efficiently to retrieve the average power generation of a solar panel for each 1-hour interval over the past month?

What if data cannot be collected at precise time intervals? For example, what if sensor data is recorded at irregular intervals? How would you resample the data to accurately perform a specific time-based query, such as retrieving the DC voltage of each panel at exactly 12:00:00 each day for the past month?

What if you need to retrieve sensor values for a specific time period (e.g., from 2024-03-01 12:48:00 to 2024-03-01 13:00:00)? Can you time-index your records to make these kind of queries faster?

These examples illustrate how seemingly simple tasks can become extremely difficult if your database is not tailored for timeseries data.

The difficulties of handling time-based data have led to the development of specialized timeseries databases designed specifically to deal with these challenges. Timeseries databases have features that make them very useful when working with time-labeled data (timeseries). Here are some of the most important advantages of TSDBs:

High Write Throughput: timeseries databases are optimized for frequent writes, allowing for efficient ingestion of data points.
Storage Efficiency: timeseries databases utilize compression algorithms specifically designed for timeseries data, reducing storage requirements.
Sampling / Resampling / Aggregation: The process of selecting a subset of data points from the entire timeseries, Filling null values, and analyzing trends and patterns within the data.
Time indexing: in timeseries databases records can be indexed based on timestamps.

and many more advantages...

In this article, we will explore setting up and utilizing Apache IoTDB, a powerful timeseries database packed with advanced features.

Setting-up an Apache IoTDB server

The simplest way to have an Apache IoTDB server is to pull its official docker image and run it. The following command pulls and runs Apache IoTDB docker image:

$ docker run -d -p 6667:6667 --name my-iotdb apache/iotdb

6667 is the default port of IoTDB which we bind it to port 6667 of the host OS.

Another way to have Apache IoTDB server is to run it directly on the host OS. For doing so, the host OS need to have a java runtime environment. The latest version of Apache IoTDB executable files can be downloaded from the official website of Apache IoTDB.

After downloading and extracting the compressed file, you will find the necessary scripts for running IoTDB in the sbin directory. To start the IoTDB service, execute the start-standalone.sh script. Once the service is running, you can interact with it using the command line tool or any database tool, such as DBeaver.

In the sbin directory there is another script available for working with the databse. By executing start-cli.sh a command line interface is launched. allowing you to enter database commands.

DBeaver Integration

DBeaver is a powerful and user-friendly database tool that offers a wide range of features for database management, including query execution, data visualization, and schema browsing. These capabilities make it a good choice for interacting with Apache IoTDB.

To integrate DBeaver with Apache IoTDB, you need to install the DBeaver Ultimate version. After installing DBeaver Ultimate, obtain the IoTDB driver and add it to DBeaver's database drivers. You can download the appropriate Apache IoTDB driver from this link. The driver is simply a .jar file.

DBeaver's drivers can be managed from Database menu.

In Driver Manager window, click on the New button to setup a new driver. The settings are straightforward and simple. From the Libraries tab you need to click on Add File button and select the driver file (.jar file).

Once you've added the driver to DBeaver's driver manager, you can create a new connection to IoTDB just like you would with other databases.

Working with Apache IoTDB

Apache IoTDB uses an SQL-like syntax for interacting with timeseries data, making it familiar and accessible for users accustomed to other database systems. This allows users to leverage their existing SQL knowledge and tools to query and manage timeseries data efficiently. In this section, we will explore some of the most useful commands and queries for timeseries data. Before diving in, it's important to understand the concept of storage groups:

Storage Group is a hierarchical path or prefix defined before creating timeseries. This path acts as a category or namespace for grouping related timeseries. The use of storage groups lies in how data is physically stored and categorized. Storage groups are specified when defining timeseries.

Let's suppose we have a measuring device named 'dev1' connected to four sensors:

module_temp
ambient_temp
irradiation
power

Data from each of these sensors creates a timeseries. Therefore, we need to define four timeseries. To define a timeseries in IoTDB, you can use the 'Create Timeseries' command.

Blue texts denote storage group of timeseries.

In our case, all timeseries have values of type float. The ENCODING=RLE parameter specifies 'Run Length Encoding', a compression method for timeseries data.

For deleting a timeseries we can use 'DELETE TIMESERIES' command:

DELETE TIMESERIES root.analog.dev1.module_temp

We can also delete an entire storage group:

DELETE STORAGE GROUP root.analog

Adding a record to timeseries can be done using INSERT INTO statement. We are also able to add values to more than one timeseries with a single INSERT INTO command:

INSERT INTO storage_group (timestamp, timeseries_name1, timeseries_name2, ..., timeseries_n) VALUES (time, value1, value2, ..., value_n)

INSERT INTO root.analog.dev1(timestamp, module_temp, ambient_temp, irradiation, power) VALUES (1709637208822, 15.4, 12.6, 55.0, 11.3)

For showing all timeseries in the database:

SHOW TIMESERIES;

Counting the number of timeseries in a storage group:

COUNT NODES root.analog.* LEVEL=2

Query

SELECT statement is used to retrieve data from the database.

SELECT module_temp, ambient_temp FROM root.analog.dev1;

Sampling with GROUP BY:

suppose we have a timeseries with name 'root.test.dev1.module_temp'. To retrieve the average value of this timeseries with time period of exactly 5 seconds from 2023-11-01 to 2023-11-07:

GROUP BY statement has the following syntax:

领英推荐

Database Replication with Change Data Capture over…

Klarrio 2 年前

How to setup MongoDB on Windows: Step-by-step guide

The Scribe Co 5 个月前

Everything you need to know about the 'star…

S3 Group Asia 2 年前

In the time_range parameter, parentheses and brackets indicate the boundaries of the range. [a, b) means the range includes a but excludes b. Conversely, (a, b) signifies that the range excludes both a and b, including only the values strictly between them.

FILL parameter can specify how empty values should be filled. The "Previous" value indicates that the last available value in the timeseries should be used to fill each null value. The "Linear" value (i.e., writing the query as Fill(Linear)) uses interpolation to fill in the null values.

Example: Suppose the energy production data of a solar panel is stored in a timeseries named root.test.dev1.power. To calculate the hourly average production of this solar panel from 2024/03/01 to 2024/03/07, the following query can be used:

SELECT AVG(power) as hourly_power FROM root.test.dev1 GROUP BY ( [2024-03-01T00:00:00, 2024-03-07T00:00:00] , 1h) FILL(LINEAR)

Uniform sampling of a timeseries using the equal_size_bucket_agg_sample function. Using the equal_size_bucket_agg_sample function, you can sample an entire timeseries by a percentage. For example, you can sample 10% of the entire timeseries and use it to represent the whole timeseries:

SELECT equal_size_bucket_agg_sample(power, 'type'='avg', 'proportion'='0.1') AS agg_power FROM root.test.dev1

In addition to enabling the execution of SQL-based queries, IoTDB provides specific functions and features that allow for complex queries and timeseries computations through the UDF module in IoTDB. Some of the capabilities available in the UDF module are listed below:

Data profiling (Auto correlation factor, histogram, integral, moving average, ...)
Anomaly detection (IQR, KSigma, LOF, ...)
Data repairing (Timestamp repair, value fill, ...)
Data quality (Completeness, Validity, Consistency, ...)

To use UDF functionalities, you can add it as a plugin to IoTDB database. for doing this, you need to download UDF plugin .JAR file. Copy the JAR package to ext\udf under the directory of IoTDB system. Run register-UDF.sh script which is in sbin directory of IoTDB.

Below are some examples of UDF queries:

RESAMPLE

This function is used for resampling the input series based on a fixed frequency. It supports both up-sampling and down-sampling mechanisms.

Example:

SELECT RE_SAMPLE(temp, 'every'='5s', 'interp'='linear') FROM root.test.dev1;

2. Integral

This function can be used to calculate the integral of a timeseries.

Example:

SELECT INTEGRAL(temp, 'unit'='1m') FROM root.test.dev1;

3. StdDev

This function calculates the standard deviation of a timeseries:

SELECT STDDEV(temp) FROM root.test.dev1;

4. IQR (Anomaly Detection)

The IQR function can be used for anomaly detection. The IQR function uses the Interquartile Range method.

SELECT IQR(measurement) FROM [timeseries name];

Example: Suppose in a timeseries with values ranging from 0 to 20, a value of 123.11 is also included. The value 123, which is considered an anomaly in the timeseries data, can be detected using the IQR method.

5. Completeness

This function is used to calculate the completeness of a timeseries over specified time intervals. For example, suppose a timeseries includes the values of a signal at 1-second intervals. Using the Completeness function, you can calculate how complete this timeseries is. If values are available for every second, then the timeseries has a Completeness of 1. If values are available for half of the seconds, then the Completeness is 0.5.

SELECT COMPLETENESS(measurement, 'window'='[window size]') FROM [timeseries name];

Example: Calculate the completeness for 30-second time windows:

6. DTW (Dynamic Time Warping)

This function is used to measure how similar two timeseries are.

SELECT DTW(timeseries1,timeseries2) FROM [storage group];

SELECT DTW(temp, dc_power) FROM root.test.dev1;

Evaluating IoTDB Performance Across Various Query Types

In this section, the speed of creating, inserting and performing different queries is examined. Note that the following experiments were conducted on a VMWare virtual machine running Ubuntu 22.04, equipped with a 4-core vCPU and 8 GB of RAM.

Initially, 5 timeseries with float values are defined. It is assumed that the timeseries correspond to the values of module_temperature, ambient_temperature, irradiation, ac_power, and dc_power of a solar panel.

CREATE TIMESERIES  root.test.device1.module_temperature WITH DATATYPE=float, ENCODING=RLE, COPRESSION=LZ4

CREATE TIMESERIES  root.test.device1.ambient_temperature WITH DATATYPE=float, ENCODING=RLE, COPRESSION=LZ4

CREATE TIMESERIES  root.test.device1.irradiation WITH DATATYPE=float, ENCODING=RLE, COPRESSION=LZ4

CREATE TIMESERIES  root.test.device1.ac_power WITH DATATYPE=float, ENCODING=RLE, COPRESSION=LZ4

CREATE TIMESERIES  root.test.device1.dc_power WITH DATATYPE=float, ENCODING=RLE, COPRESSION=LZ4

After creating 5 timeseries, we populate them with 100 million randomly generated values (between -1 and +1), spanning from 2024-03-08 06:16:44 to 2024-03-09 10:03:24 at 1-millisecond intervals. The disk space required to store these five timeseries, each containing 100 million values (totaling 500 million values), is less than 500 megabytes, demonstrating the storage efficiency of IoTDB.

Retrieving maximum value of each hour:

Retrieving max value of each hour from a timeseries with 100 million values took 1.73 seconds.

Calculating the standard deviation of timeseries:

StdDev took 50 seconds for a timeseries with 100 million values.

Calculating the hourly integral of module_temperature :

In summary, IoTDB is a powerful database made for managing large amounts of time-based data. I explored its capabilities, highlighting its strong features for storing data, processing queries quickly, and performing real-time analytics. IoTDB integrates smoothly with IoT applications and excels in handling extensive time-based data, making it a dependable choice for businesses needing scalable and efficient data management solutions.

要查看或添加评论，请登录

Ali Farahani的更多文章

RAG: Combination of Data Retrieval and LLM Prompting

2024年8月27日

RAG: Combination of Data Retrieval and LLM Prompting

Imagine you want to build a custom language model for a specific task or you want to develop a special chat assistant…
GANs: simple in theory but very hard to Train, Here's Why.

2024年8月23日

GANs: simple in theory but very hard to Train, Here's Why.

Even though GANs are simple to understand in theory, training them is very challenging in practice. Unfortunately, most…
Customizing Llama-3: A Guide to Fine-Tuning on Your Dataset

2024年6月28日

Customizing Llama-3: A Guide to Fine-Tuning on Your Dataset

Llama-3 is one of the latest LLMs released by Meta under an open-source license. Its impressive performance in…

2 条评论
Understanding entropy vs. surprise

2024年6月16日

Understanding entropy vs. surprise

Entropy literally means disorder or confusion. Entropy is a highly applicable concept in fields such as mechanics…

Apache IoTDB: a timeseries database

Ali Farahani

PhD candidate in Artificial Intelligence

Setting-up an Apache IoTDB server

DBeaver Integration

Working with Apache IoTDB

Query

领英推荐

Evaluating IoTDB Performance Across Various Query Types

Ali Farahani的更多文章

社区洞察

其他会员也浏览了

Mongo DB: is it worth the switch?

SQLite vs PostgreSQL: A Detailed Comparison

Top 15 Emerging Databases to Use in 2022 and Beyond

Mongo VS Postgres

Understanding CAP and ACID in Database Design

A gentle introduction to Embedded Databases

Creating a highly available and fault-tolerant Postgresql database cluster

Distributed PostgreSQL without Sharding constraint for SQL joins

Integrating Multiple Databases Seamlessly using GoLang with PostgreSQL

Streaming MySQL tables into Kafka (Real Time)

Setting-up an Apache IoTDB server

DBeaver Integration

Working with Apache IoTDB

Query

领英推荐

Evaluating IoTDB Performance Across Various Query Types

Ali Farahani的更多文章

RAG: Combination of Data Retrieval and LLM Prompting

GANs: simple in theory but very hard to Train, Here's Why.

Customizing Llama-3: A Guide to Fine-Tuning on Your Dataset

Understanding entropy vs. surprise

社区洞察

其他会员也浏览了

Mongo DB: is it worth the switch?

SQLite vs PostgreSQL: A Detailed Comparison

Top 15 Emerging Databases to Use in 2022 and Beyond

Mongo VS Postgres

Understanding CAP and ACID in Database Design

A gentle introduction to Embedded Databases

Creating a highly available and fault-tolerant Postgresql database cluster

Distributed PostgreSQL without Sharding constraint for SQL joins

Integrating Multiple Databases Seamlessly using GoLang with PostgreSQL

Streaming MySQL tables into Kafka (Real Time)