Edge or Cloud? - Placing IIoT Workloads

Intent

This document has been written to help advise those planning an Industrial Internet of Things implementation as to where to place workloads within the data communications pipeline. The expectation is that the scenario is “connected”. As in there is intended to be a permanent internet (or intranet) connection between the factory and the cloud. Loss of this connection is an unwanted event, but one that needs to be considered and accounted for. Other Edge scenarios exist where connectivity to the cloud is expected or designed to be sporadic, this document does not directly cover those scenarios.??

Hybridisation of Cloud IoT

With the increasing hybridisation of what were purely cloud based IoT Solutions a few years ago, there is an increasing tendency to keep some aspects of the IoT Solution “on prem” or “on Edge” Edge Solutions are far more than ingestion engines now, capabilities are there for:

  • Data Processing
  • Persistence
  • User interaction
  • Advanced Analytics/AI

This document is intended to help determine the factors to consider when deciding where to place a workload.

Glossary

Edge – The environment where the data is initially acquired or created. This can be equated with the phrase “on-premise”… but in an IIoT world, the Edge is generally seen as the Factory floor, ISA 95 levels 2 and 3. However as seen later in this document, the definition of Edge can stretch to cover all levels of the ISA 95 hierarchy.

Cloud - A term used to describe a global network of servers, each with a unique function. Specifically in relation to this document, the Azure cloud provided by Microsoft.

Hybrid – An architecture that uses a mix of Edge and Cloud technology to achieve its aims.

Protocol - A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchronization of communication and possible error recovery methods.

Protocol Translation – the means of taking data in a structure defined by a specific protocol and transforming it so that it meets the standards of another.

Data Ingestion – reception of data into a data collection device via a specific protocol. Inference/Inferencing – the use of AI or ML models to determine if a particular condition has been met.

AI – Artificial Intelligence - the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision making, and translation between languages.

ML – Machine Learning - the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and draw inferences from patterns in data.?

Typical IIoT Architecture

In the IIoT environment, there are typically PLCs producing data in a variety of formats. The current best practice is seen as using translation software to convert all of this disparate data into a single format, usually OPC-UA. There are a number of off the shelf applications that can do this. Many factories already run one of these to act as a data interchange layer within the factory environment. For those that don’t, many of the vendors also now have containerised versions of these applications that can run as modules within the Edge environment.

IoT Edge is the Microsoft IoT edge processing environment, it is software that can be run on a variety of hardware and operating systems. Within IoT Edge, containerised workloads process data and there’s the ability to route data between these containers and/or up to the cloud. The definition of which containers are run within IoT Edge and the routings between them is defined from the cloud using IoT Hub.

Once data is in the cloud multiple applications can use the data to provide the required insights and actions. Avoiding separate data acquisition strategies and data silos is a key aspect of efficient architecture design in these scenarios.?

Data Acquisition and Ingestion

Factors to consider:

  • “Speeds and Feeds”, how many data points and how often
  • Where is the data needed and who by?
  • Is the data even required?
  • When is the data needed?
  • Required Reaction time, what is the necessary response?
  • Persistence Strategy
  • Network Bandwidth Availability
  • External Network Stability
  • Storage cost
  • Would aggregated data suffice? If so, at what level?

Discussion

Many early stage IoT projects have unrealistic targets for real time data acquisition, this expectation can be summarised as “everything, now!”. Whilst IoT solutions and the cloud have hyperscale capabilities, there are other implications to sending all data to the cloud in near real time. These are:

  • Cost – Ingestion services typically charge by the message, so cost is directly linked to volume of data ingested. The cost per message decreases as the volume of messages increases BUT there are implications downstream. As the volume of data being ingested grows, the compute and storage requirements have to grow to accommodate this. It is quite easy to ingest large volumes of data into the cloud, scaling a back-end processing stream to cope with this flow can have its challenges.
  • Infrastructure – Sending large volumes of data also has an impact at the Edge, the server capturing and sending the data needs to be appropriately sized. However the restricting factor is often network bandwidth. The outbound network capacity from a factory is finite AND has to be shared with other users. An IIoT solution saturating the network can have catastrophic impact on other areas of production.

Mitigations

If you do have to send large volumes of data, investigate how the messages are to be sent, typical IoT demonstrations will use plain text JSON structures, these are easy to work with, and many backend systems accept this data. However, if you start looking into it, the message overhead makes such data structures inefficient, the structure surrounding the actual data point typically dwarfs it. So looking into more efficient data structures is one potential option, albeit at an increased compute cost in the cloud to transform the data back into a more easily usable format. Other options here include:

  • Batching – building a large message containing multiple data points and sending these as a single message. IoT Hub bills in 4k increments, but is throttled by the number of distinct messages, so for maximum throughput, send messages that are as close to the 256k maximum message size limit as possible.
  • Sending only on change. Do not send the same name value pair every n seconds if the value is unchanged, only send when it changes, the vast majority of TimeSeries applications can interpolate and fill in the gaps in the record for you.
  • Dead-Banding – this is a variation on send on change, but is setting a limit as to how much the value has to change before a message is sent.
  • Aggregation – in many scenarios industrial control systems are capable of sending data at very high frequency, anywhere between 400hz and 1khz. Data at these speeds is difficult to send to the cloud in real time and in many cases, aggregations of this data are actually more usable in the real world for data science purposes. Capturing Min, Max and Avg for each data point (Tag) over a period of a few minutes still gives an excellent indication of where the values are drifting away from the ideal. Outliers show up in the Max and Min and movement of the Avg shows whether there is an overall drift.
  • Inferencing – If doing analytics on the data at the edge, does the source data that was used to generate the result need to be sent to the cloud immediately or only the result of the inferencing?
  • Send data less often – see next section!?

Alternatives to Telemetry

In the early stages of the project, determine the value of the data, what is it used for, why is it important, how long can the business cope without this data? There will usually be a sub-set of the data that is needed in the cloud ASAP, but large swathes of it can be sent “later”. The simplest way of doing this is by writing the data to an efficient file format (not JSON or XML) and then using some form of file transfer mechanism to transmit the file in a way that does not saturate network bandwidth. The diagram below shows a dual strategy for sending aggregate data in near-real-time and the granular data using a batch process as files.??

Block diagram sowing source data being transmitted to Edge device and processed in two flows. One doing real time aggregations, the other writing to a local file and sending the file data to the cloud periodically

Once in the cloud, the data in the files can be transformed and written into data-stores or even run through streaming analytics engines to look for patterns/events.?

User Interaction

The whole aim of an IoT solution is to get the data or insights to the consumer of them at the time they are needed. For operational staff who depend on those insights, locating the solution in the cloud puts factory operations at risk as connections to the cloud cannot be guaranteed. In these cases, locating this part of the solution “on-premise” gives better availability at the cost of local infrastructure that needs to be acquired and maintained.

Factors influencing decision

  • Who needs access to the data?
  • Criticality of data/insight
  • Latency
  • Availability
  • Frequency of Network Outages
  • Typical length of Network Outages
  • Geography
  • Security (access to factory floor data)
  • Potential to alleviate via Nested Edge (see later)
  • Volumes of data being queried

Discussion

Solutions that enable operationally critical data driven decisions will need to be based on-premise. In some cases what is operationally critical has only been determined once there has been a network outage and staff have been unable to access particular cloud-based information. The decision process for what to place where typically comes down to two factors:

1) Impact on Production

2) Time to impact – i.e. how long can production survive without this data?

Anything where the absence of data will cause a high level of impact on production capability in a small space of time needs to be based on-premise. Given the criticality, HA solutions should be considered. As the impact decreases and timescales to impact lessen, workloads can increasingly be moved to the cloud.

In between these two fairly obvious choices, lies the “dilemma zone” placing workloads that fall into this area is hardest as there are arguments for both locations. Be very aware of all of the risks/benefits of both options before taking any decision.?

Four box matrix with Impact to Production on Horizontal axis and Time to Impact on vertical axis. Matrix filled with three coloured areas. Mostly bottom left is Edge. Mostly top right is Cloud with the dilemma zone in the middle

Data Persistence

The decision as to where data is stored is usually closely tied to where it is being used, critical on-premise applications typically require on-premise data. However, there are other influences on where data should be stored.

Factors to consider

  • Data Volume
  • Legislation
  • Retention Period
  • ? On-prem – days or weeks
  • ? Cloud – years or decades
  • Query patterns & complexity
  • Who needs access

Discussion

In general terms cloud has the advantages when it comes to bulk data storage, but there are times when data cannot only reside in the cloud. The main influencers as to why some data remains on-premise are:

1) Legislation, from discussions with customers in the Pharma industry there’s a general interpretation of FDA regulations that results in the primary data used to maintain process validation has to remain on-premise where the material was produced.

2) Critical Interaction, as per the User Interaction section earlier on-premise production critical applications need data on-prem to support the application(s).

A simple implementation of this could look like

Block diagram showing source data being sent to Edge device and within that device being persisted in a database and served via a web server or BI tool

And a worked example for setting this type of solution can be found at :

Whilst the above example works and the model has been used by a number of customers. There’s also need for discussion as to where in the on-premise network to store the data. Data acquisition for IoT usually occurs in ISA-95 network layer 2 or 3 depending on the implementation choices made, however if data storage and applications also reside at this level it heavily restricts who can access the data/applications. This limitation can be overcome using the nested capabilities of IoT Edge to locate the persistence service(s) and application(s) at level 5

An implementation of this model would look like:

Block diagram showing the typical ISA 95/Purdue network levels with data being generated at Level 2, acquired by IoT Edge on Level 3, the data is then passed up level by level using nested IoT edge to level 5 where it is persisted in a local SQL Edge instance and visualised using Graphana. Selected data is also sent to IoT Hub in the cloud from this level

A worked example for setting this architecture up can be found at :

there is also a video at

Data Processing

This is an increasingly important part of Edge Capability, especially with AI or ML workloads. Being able to process high volumes of complex data close to the source and send the results to the cloud without having to send the source data is an increasingly common use-case.

Factors to consider

  • Volume of data to be processed
  • Is it feasible to send to cloud
  • Is on-premise hardware capable of handling the volume/complexity
  • Who or what is the consumer of the result and how quickly is the result required.
  • Is the result of the processing production critical

Discussion

The decision as to where you base these workloads will vary by use-case as each have their own criteria.

Data Transformation

If the task is protocol translation, then Edge is typically the answer as in many cases the original raw data is unusable and of little value until translated. Doing translation or normalisation on the edge also gives a consistent feed of data into the cloud processing stream even if the data sources are not homogenous. If there is any edge based BI service for this data then obviously the data has to be made usable before being persisted in whatever database the local BI service uses.

However if the source data is usable and consistent, i.e. homogenous device/sensor estate but the data needs either enriching or converting then doing the processing in the cloud makes sense as there’s only one conversion/enrichment process and cloud gives the ability to scale with volume. In many cases like this the raw data is persisted (sometimes short term) just to provide the ability to check back on what the original data was in case of outlier data values, i.e. is this a genuine outlier or is there a processing fault?

Aggregation

As discussed earlier, aggregation is a good way to reduce the volume of data being sent in near real time as telemetry, whilst maintaining data usability/value. However, if the originating volumes are not excessive and both aggregations and raw data are required by the end process then performing and persisting the aggregations in the cloud makes sense.

?Inference

Where complex AI/ML models are being run where the volume of source data massively exceeds the output from the inferencing process, then edge processing makes a lot of sense. With video and sound analytics, sending the live data to the cloud would potentially become cost prohibitive quite quickly and also lead to latency issues. Services such as Azure Video Analyzer is a great example of this form of Edge Processing. This model also uses a separate service (Azure Media Services) to send video snippets of interest to the cloud in a measured way.?

Block architecture diagram of Azure Video Analyser showing inferencing of video feed being performed on Edge Device with feed of insights being passed to the cloud.

With some ML tasks that are purely data based, if the result is not time critical, then these can be run in the cloud. As soon as latency or criticality become issues, then the balance moves rapidly towards the Edge.

Training Data

This is a follow-on to the previous section on processing. AI and ML workloads typically require large volumes of training data in order to create and train the model. The exception being if an off-theshelf model i.e. Object Detection is being used.

In general, the cloud is a better location for ML/AI training. The workload is typically sporadic and compute intensive so requires investment on infrequently used assets if being done locally. Cloud allows the training capability to be stood-up/torn down as required thus saving on both capex and opex costs.

Factors to consider

  • Volume of data required to train model
  • Frequency of training
  • Compute infrastructure requirements for training
  • CI/CD pipeline for publishing trained models

Discussion

The decision logic here depends on the type of model being generated and the type of data being used to train the model. Video/Images There are two options here:

1) Gather and categorise images locally and then upload them to storage (or AI service) in the cloud.

2) Stream images to the cloud via an Edge-based mechanism, e.g. Blob Storage and then classify etc once images are there. Once a model is trained, slow the flow of data from Edge to cloud down so that you have just enough images to enable re-training of the model.

Data based ML

In cases such as predictive maintenance, you need lots of data with plenty of examples of failure in order to train the model. Cloud is the best option here. Depending on the frequency of failure it may take some years to gather enough data to create a working model. At the point in time that you start collecting data you may not know which data points are key to the model, so it’s better to collect too much data. Finding out that a particular data point is key when you have not been collecting it is?highly frustrating as you now have to turn on collection and wait for sufficient data AGAIN. This can cost years!

Whilst waiting for sufficient data to build predictive models, use simple anomaly detection models to pick up when activity deviates from “normal”. The value is two-fold:

1) There’s a chance to prevent a failure

2) If the anomaly is related to an impending failure, you now have an example of the data leading up to the condition and can give this information as a starting point to data scientists to look for other instances in historical data.

If there is historical data located on-premise, this can be copied/migrated to cloud and in some cases can speed the development of predictive models.?

Other considerations

The main non-data related consideration in this discussion are capex for equipment and costs for maintenance. Whist some degree of Edge processing is always required, increasing the number of workloads on Edge also increases:

  • The Capex cost of the hardware to run it
  • The amount of monitoring and maintenance of Edge equipment and software
  • OS Patching
  • Security updates
  • Gateway software updates
  • Edge component (module) updates
  • System health monitoring
  • If Edge workloads are business critical, then there’s potential for at least HA capability and potentially DR. Edge based applications with databases require some form of back-up.

The good news, is that there is increasing maturity to help with the workload side of this, Azure Device Update can help automate OS, security and Edge updates. Well written CI/CD pipelines can automate the Edge component distribution and IoT Edge now includes health monitoring that links to Azure Monitor to reduce effort required to monitor the health of Edge devices and the processes running on them.?

Graziano Galante

Innovator I Curious Strategist I Transformational Leader

3 年

Well done, Ian!

要查看或添加评论,请登录

Ian Banham的更多文章

社区洞察

其他会员也浏览了