Edge or Cloud? - Placing IIoT Workloads
Intent
This document has been written to help advise those planning an Industrial Internet of Things implementation as to where to place workloads within the data communications pipeline. The expectation is that the scenario is “connected”. As in there is intended to be a permanent internet (or intranet) connection between the factory and the cloud. Loss of this connection is an unwanted event, but one that needs to be considered and accounted for. Other Edge scenarios exist where connectivity to the cloud is expected or designed to be sporadic, this document does not directly cover those scenarios.??
Hybridisation of Cloud IoT
With the increasing hybridisation of what were purely cloud based IoT Solutions a few years ago, there is an increasing tendency to keep some aspects of the IoT Solution “on prem” or “on Edge” Edge Solutions are far more than ingestion engines now, capabilities are there for:
This document is intended to help determine the factors to consider when deciding where to place a workload.
Glossary
Edge – The environment where the data is initially acquired or created. This can be equated with the phrase “on-premise”… but in an IIoT world, the Edge is generally seen as the Factory floor, ISA 95 levels 2 and 3. However as seen later in this document, the definition of Edge can stretch to cover all levels of the ISA 95 hierarchy.
Cloud - A term used to describe a global network of servers, each with a unique function. Specifically in relation to this document, the Azure cloud provided by Microsoft.
Hybrid – An architecture that uses a mix of Edge and Cloud technology to achieve its aims.
Protocol - A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchronization of communication and possible error recovery methods.
Protocol Translation – the means of taking data in a structure defined by a specific protocol and transforming it so that it meets the standards of another.
Data Ingestion – reception of data into a data collection device via a specific protocol. Inference/Inferencing – the use of AI or ML models to determine if a particular condition has been met.
AI – Artificial Intelligence - the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision making, and translation between languages.
ML – Machine Learning - the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and draw inferences from patterns in data.?
Typical IIoT Architecture
In the IIoT environment, there are typically PLCs producing data in a variety of formats. The current best practice is seen as using translation software to convert all of this disparate data into a single format, usually OPC-UA. There are a number of off the shelf applications that can do this. Many factories already run one of these to act as a data interchange layer within the factory environment. For those that don’t, many of the vendors also now have containerised versions of these applications that can run as modules within the Edge environment.
IoT Edge is the Microsoft IoT edge processing environment, it is software that can be run on a variety of hardware and operating systems. Within IoT Edge, containerised workloads process data and there’s the ability to route data between these containers and/or up to the cloud. The definition of which containers are run within IoT Edge and the routings between them is defined from the cloud using IoT Hub.
Once data is in the cloud multiple applications can use the data to provide the required insights and actions. Avoiding separate data acquisition strategies and data silos is a key aspect of efficient architecture design in these scenarios.?
Data Acquisition and Ingestion
Factors to consider:
Discussion
Many early stage IoT projects have unrealistic targets for real time data acquisition, this expectation can be summarised as “everything, now!”. Whilst IoT solutions and the cloud have hyperscale capabilities, there are other implications to sending all data to the cloud in near real time. These are:
Mitigations
If you do have to send large volumes of data, investigate how the messages are to be sent, typical IoT demonstrations will use plain text JSON structures, these are easy to work with, and many backend systems accept this data. However, if you start looking into it, the message overhead makes such data structures inefficient, the structure surrounding the actual data point typically dwarfs it. So looking into more efficient data structures is one potential option, albeit at an increased compute cost in the cloud to transform the data back into a more easily usable format. Other options here include:
Alternatives to Telemetry
In the early stages of the project, determine the value of the data, what is it used for, why is it important, how long can the business cope without this data? There will usually be a sub-set of the data that is needed in the cloud ASAP, but large swathes of it can be sent “later”. The simplest way of doing this is by writing the data to an efficient file format (not JSON or XML) and then using some form of file transfer mechanism to transmit the file in a way that does not saturate network bandwidth. The diagram below shows a dual strategy for sending aggregate data in near-real-time and the granular data using a batch process as files.??
Once in the cloud, the data in the files can be transformed and written into data-stores or even run through streaming analytics engines to look for patterns/events.?
User Interaction
The whole aim of an IoT solution is to get the data or insights to the consumer of them at the time they are needed. For operational staff who depend on those insights, locating the solution in the cloud puts factory operations at risk as connections to the cloud cannot be guaranteed. In these cases, locating this part of the solution “on-premise” gives better availability at the cost of local infrastructure that needs to be acquired and maintained.
Factors influencing decision
Discussion
Solutions that enable operationally critical data driven decisions will need to be based on-premise. In some cases what is operationally critical has only been determined once there has been a network outage and staff have been unable to access particular cloud-based information. The decision process for what to place where typically comes down to two factors:
1) Impact on Production
2) Time to impact – i.e. how long can production survive without this data?
Anything where the absence of data will cause a high level of impact on production capability in a small space of time needs to be based on-premise. Given the criticality, HA solutions should be considered. As the impact decreases and timescales to impact lessen, workloads can increasingly be moved to the cloud.
In between these two fairly obvious choices, lies the “dilemma zone” placing workloads that fall into this area is hardest as there are arguments for both locations. Be very aware of all of the risks/benefits of both options before taking any decision.?
Data Persistence
The decision as to where data is stored is usually closely tied to where it is being used, critical on-premise applications typically require on-premise data. However, there are other influences on where data should be stored.
Factors to consider
Discussion
领英推荐
In general terms cloud has the advantages when it comes to bulk data storage, but there are times when data cannot only reside in the cloud. The main influencers as to why some data remains on-premise are:
1) Legislation, from discussions with customers in the Pharma industry there’s a general interpretation of FDA regulations that results in the primary data used to maintain process validation has to remain on-premise where the material was produced.
2) Critical Interaction, as per the User Interaction section earlier on-premise production critical applications need data on-prem to support the application(s).
A simple implementation of this could look like
And a worked example for setting this type of solution can be found at :
Whilst the above example works and the model has been used by a number of customers. There’s also need for discussion as to where in the on-premise network to store the data. Data acquisition for IoT usually occurs in ISA-95 network layer 2 or 3 depending on the implementation choices made, however if data storage and applications also reside at this level it heavily restricts who can access the data/applications. This limitation can be overcome using the nested capabilities of IoT Edge to locate the persistence service(s) and application(s) at level 5
An implementation of this model would look like:
A worked example for setting this architecture up can be found at :
there is also a video at
Data Processing
This is an increasingly important part of Edge Capability, especially with AI or ML workloads. Being able to process high volumes of complex data close to the source and send the results to the cloud without having to send the source data is an increasingly common use-case.
Factors to consider
Discussion
The decision as to where you base these workloads will vary by use-case as each have their own criteria.
Data Transformation
If the task is protocol translation, then Edge is typically the answer as in many cases the original raw data is unusable and of little value until translated. Doing translation or normalisation on the edge also gives a consistent feed of data into the cloud processing stream even if the data sources are not homogenous. If there is any edge based BI service for this data then obviously the data has to be made usable before being persisted in whatever database the local BI service uses.
However if the source data is usable and consistent, i.e. homogenous device/sensor estate but the data needs either enriching or converting then doing the processing in the cloud makes sense as there’s only one conversion/enrichment process and cloud gives the ability to scale with volume. In many cases like this the raw data is persisted (sometimes short term) just to provide the ability to check back on what the original data was in case of outlier data values, i.e. is this a genuine outlier or is there a processing fault?
Aggregation
As discussed earlier, aggregation is a good way to reduce the volume of data being sent in near real time as telemetry, whilst maintaining data usability/value. However, if the originating volumes are not excessive and both aggregations and raw data are required by the end process then performing and persisting the aggregations in the cloud makes sense.
?Inference
Where complex AI/ML models are being run where the volume of source data massively exceeds the output from the inferencing process, then edge processing makes a lot of sense. With video and sound analytics, sending the live data to the cloud would potentially become cost prohibitive quite quickly and also lead to latency issues. Services such as Azure Video Analyzer is a great example of this form of Edge Processing. This model also uses a separate service (Azure Media Services) to send video snippets of interest to the cloud in a measured way.?
With some ML tasks that are purely data based, if the result is not time critical, then these can be run in the cloud. As soon as latency or criticality become issues, then the balance moves rapidly towards the Edge.
Training Data
This is a follow-on to the previous section on processing. AI and ML workloads typically require large volumes of training data in order to create and train the model. The exception being if an off-theshelf model i.e. Object Detection is being used.
In general, the cloud is a better location for ML/AI training. The workload is typically sporadic and compute intensive so requires investment on infrequently used assets if being done locally. Cloud allows the training capability to be stood-up/torn down as required thus saving on both capex and opex costs.
Factors to consider
Discussion
The decision logic here depends on the type of model being generated and the type of data being used to train the model. Video/Images There are two options here:
1) Gather and categorise images locally and then upload them to storage (or AI service) in the cloud.
2) Stream images to the cloud via an Edge-based mechanism, e.g. Blob Storage and then classify etc once images are there. Once a model is trained, slow the flow of data from Edge to cloud down so that you have just enough images to enable re-training of the model.
Data based ML
In cases such as predictive maintenance, you need lots of data with plenty of examples of failure in order to train the model. Cloud is the best option here. Depending on the frequency of failure it may take some years to gather enough data to create a working model. At the point in time that you start collecting data you may not know which data points are key to the model, so it’s better to collect too much data. Finding out that a particular data point is key when you have not been collecting it is?highly frustrating as you now have to turn on collection and wait for sufficient data AGAIN. This can cost years!
Whilst waiting for sufficient data to build predictive models, use simple anomaly detection models to pick up when activity deviates from “normal”. The value is two-fold:
1) There’s a chance to prevent a failure
2) If the anomaly is related to an impending failure, you now have an example of the data leading up to the condition and can give this information as a starting point to data scientists to look for other instances in historical data.
If there is historical data located on-premise, this can be copied/migrated to cloud and in some cases can speed the development of predictive models.?
Other considerations
The main non-data related consideration in this discussion are capex for equipment and costs for maintenance. Whist some degree of Edge processing is always required, increasing the number of workloads on Edge also increases:
The good news, is that there is increasing maturity to help with the workload side of this, Azure Device Update can help automate OS, security and Edge updates. Well written CI/CD pipelines can automate the Edge component distribution and IoT Edge now includes health monitoring that links to Azure Monitor to reduce effort required to monitor the health of Edge devices and the processes running on them.?
Innovator I Curious Strategist I Transformational Leader
3 年Well done, Ian!