Data Pipelines, The Heart of AIoT

Data Pipelines, The Heart of AIoT

Greetings from West Michigan! We’ve been dealing with an unseasonable heat wave around the Great Lakes, to the point where kids are staying home from school for “heat days”. That’s a far cry from the typical snow days we’ll be getting in January and February.?

And it’s not just the air that’s warm. The big lakes are mighty warm, too. Here’s a screenshot of the water temperatures around the Great Lakes, courtesy of Seagull. The temperatures in red are in the mid to upper 70° F. That’s the equivalent of hot tub weather for these big bodies of water.

Seagull is a data platform that aggregates real-time and historical data, and operated by the Great Lakes Observing System, otherwise known as GLOS. SpinDance helped build Seagull a few years ago, and it’s one of the team’s favorite projects. The GLOS team was a joy to work with, and it’s not often you get to work on such an impactful project that serves your backyard.?

Like all IoT-enabled data platforms, at the heart of Seagull is a robust data pipeline. It is the technology that captures and transforms data into actionable information. And it’s the focus of this issue of The Intelligent Device.

Data Pipelines: The Heart of AIoT

Data is central to Artificial Intelligence. We use data to train AI models. And we feed data through those models to make predictions. These predictions, in turn, are where the value of an AI model comes from, in the form of analysis and decision-making support. (For a recap, check out our previous issue about the CADA framework).

In a production setting, we use pipelines to manage our data at scale. In its most simple form, a data pipeline is composed of four steps:

  1. Ingest: The data is ingested into the pipeline, either through a hardware sensor, or an Application Programming Interface (API).
  2. Process: The data is processed. This typically involves validating the data against quality standards, as well as enriching it with additional data points. For example, we might add an identifier and timestamp to a reading. In this step, we might also downsample the data through aggregation and summarization.?
  3. Store: The processed data is stored for later use. For so-called “cold data” we might store it for days, weeks, months or years. For real-timehot data we might only store it for seconds, minutes or hours.?
  4. Delivery: Finally, the data is delivered to an upstream consumer. A consumer might be a human, or another digital system.?

In the Internet of Things, there are typically multiple pipelines working in concerts. For example, each device might act as a mini pipeline, and deliver their results to the cloud or on-prem data center:

“Edge” computing can add yet another layer. Each edge device accepts data from devices, and delivers refined data to the cloud.

Well-Architected Pipelines Save Money

At first, this architecture might seem silly: why not just send all the source data directly to the final destination? Aren’t we just increasing the overall cost of the system by adding so many intermediate steps??

The answer is a resounding “no”. Seemingly simple systems can? actually be more expensive. Here’s a real-world example to explain why:

About a decade ago, SpinDance inherited a first-generation IoT product that collected large amounts of environmental data. The devices were very simple, and collected temperature, humidity, and other data points every 20 seconds, and sent them immediately to the cloud.

With about 120,000 devices in the field, this worked out to around 360,000 readings a minute.

The thing was, the underlying data didn’t change that fast, and therefore didn’t necessitate sending the readings so frequently. A few times an hour would have been fine. And the cost of sending all that data was immense: it was a colossal waste of dollars in terms of bandwidth and compute.

We redesigned the second generation devices to be much more intelligent. Each device stored the data for 15 minutes, and sent an average of the readings on to the cloud. This reduced the number of messages sent to around 15,000 per minute, a 96% reduction in bandwidth and compute costs. In short order, these cost savings dwarfed the expense of the slightly more advanced devices.

Coming up: Connecting IoT Data to AI

In the long run, the true value of an AIoT system isn't just in the devices or the data but in how effectively that data is managed and utilized. Efficient pipelines not only cut costs but also enable faster, more accurate decision-making. Investing in robust pipeline design today is an investment in your product's success tomorrow. In our next issue, we’ll connect an IoT data pipeline to AI Models. If pipelines are the heart of IoT, models are its brain.?

And don’t forget: early September is a great time to visit us in Michigan, and take a dip in the Big Lake. It’s salt-free heaven!


Jonathan Beri

I help companies connect sand to the internet. Building @golioth with an awesome team.

1 个月

Couldn't agree more! Great piece.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了