Big DATA = BIG problems

Big DATA = BIG problems

My roots go back in the '80, when I studied engineering, and my professional business journey did bring me to solving business problems with data. And like for many engineers, it's the affinity for the “rational” that attracts them to the field in the first place, I'm fascinated why the "Cloud sharks" securing unprecedented profits of billions and is feeding their own model.

The problem is GRAVITY

In classical Newtonian theory, the force of gravity increases with mass but decreases with the square of the distance.

McCrory says that we can use this to visualise a key problem in (i)Iot data in which the shift of data to cloud and cloud-adjacent colocation environments is creating larger data gravity forces, making it harder to run applications and store data far away from where our data originates. When it comes to application and system mobility, we are still impacted by Latency and Throughput, making such data movement hard, particularly when addressing vast Data Lakes. McCrory also determined the key factors that are preventing Data Gravity mitigation, including, Network Bandwith, Network Latency, Data Contention vs. Data Distribution, Volume of Data vs. Processing, Data Governance/ Laws/ Security/ Provenance, Metadata Creation/ Accumulation/ Context, and Failure State.

No alt text provided for this image

In the case of data movement between clouds, the real puzzle is how to dilute and reduce data to it’s most essential fundament, a sequence of bits and bytes that never repeat themselves. Also known as Data Deduplication, this technology has been around for many years, but it has always been used as in a self-contained manner, this means that Data is de-duplicated in a container, drive, in a host, in a cluster, on the wire.

If was possible to de-duplicate application data at a global level, across various data centers, across clouds, across Data Lakes, and across systems then we would be guaranteeing very high-level of data availability in each part of the globe because data becomes ubiquitous and universal. Universal De-duplication makes data ubiquitous and universal, common to every possible application and system, while Metadata takes on a vital role, building datasets, enforcing policies and distribution.

The wonderful cloud

In general, three key things have made the cloud wonderful. The first is REST, the second is essentially this notion of stateless computing, and then the third are databases. What all machines want to do is transform noisy data into durable insights, facts, predictions, which you can store as long as you want. Loud it whether it's in the cloud or in a datacenter, but transformation into higher-level, more valuable insights is key.

This notion of REST, stateless, and databases, which made the cloud, is absolutely the wrong model for the edge where you want to do stateful computing, in memory, in (machine) real time.

If you want to save the data go for the cloud, just don't do it for mission critical applications, and with assets that are operating business critical processes. What Intelligent machines need is an in-memory model continually and state-fully evolved, which can then be used as embedded into learning, self-training, reinforcement learning, prediction, analysis and meaningful insights.

The REAL world is STATEFUL

The web and IaaS clouds only work because of REST and stateless computing

Stateless and stateful processing are compared here.

No alt text provided for this image

Input records are shown as black bars. The left diagram shows how a stateless operation transforms each input record at a time and outputs each result based solely on that last record or event (white bar). The diagram on the right shows that a stateful program maintains the value of state for all of the records processed so far and updates it with each new input, such that the output (gray bar) reflects results that take into account more than one event.

Solving BIG problems with MICRO data

We all know by now the many failures driven by big data/learning problems, firstly because following the hype “big data is it”, then end up with batch unusable insights, second because many don't know how to stand up these complex cloud data pipelines, third many don't have the data scientists to do the work, and last and not least it's extremely expensive. It's a case of a lot of people with hammers looking for nails instead of going to the toolbox and picking the right tools. Most of the cloud-based learning pipelines and all the big data stuff came out of the cloud native world. Spark and Hadoop and all these wonderful things came out of the native cloud folks, and they were using it in their way for their kinds of applications. It doesn't mean it applies to the real world and physical processes that we want to optimise at where the action is... @ the edge.

Where to START

We need to solve metadata first by removing noise and detecting the useable bits and bytes that are valuable and fused into data streams. This obviously has to happen as close as possible to its source in the first sensor data extraction layer where the sensors are bounded with with mechanical assets. As an example for AC Motor driven Assets the sensors for meta data extraction are typically current and voltage transformers. The latest intelligent sensors are able to solve the massive amounts of sensor meta data into universal data sets that dilute and reduce Data with the most essential fundament, a sequence of bits and bytes that never repeat themselves to infused agile into machine learning models and distribute the intelligence at the EDGE and not in the Cloud.

We need ZERO GRAVITY

Breakthroughs in unsupervised learning will enable digital twins of edge devices to self-train from fast data streams at the edge. Rather than saving vast amounts of data for later processing, developers will be able to link to systems they care about to gain key analytical insights and to predict future performance. 

Now having explained the limitations of the stateless REST architecture of most of today’s web-centric applications. In REST-based applications, much can be gained from scaling out stateless services relying on a database to manage state and synchronisation. The penalties are high: to process an event requires a database read, computation, and a database write. The latency of processing each event is dominated by database round-trip times, which may be hundreds of milliseconds — billions of wasted CPU cycles on the CPU processing the event.


In reactive, paradigm of edge computing applications, stateful architectures such as Akka, Erlang, Orleans, and Swim, each event is processed by a stateful actor — an active object with code and state that is persisted between events. Those billions of CPU cycles can be put to use to perform analysis, training, and predictions in machine real time. Tasks that demand a lot of resources in the cloud can be trivially accomplished on modest devices at the edge. Rather than a database, the state of the system is the state of the actors, each of which statefully represents a row and some number of columns of the data and offers a real-time API for updates.


Breakthroughs in the efficiency of in-stream training to permit accurate analysis and predictions, it requires actively learning at the edge from data rather than from a convoluted cloud-based training workflow. This relies on developing algorithms that learn the way each of us learns — it makes a hypothesis and compares the result with the real world. If you’re wrong, the error can be used to adjust the model for next time. If you’re right (or close enough) the model is trained. This type of black box learning, coupled with the actor model above, enables us to deliver digital twins that learn. Of course I’ve oversimplified the problem, but a tangible goal is to create generic learning frameworks that only require non-experts to set a few hyper-parameters that match their real world intuition about the behavior of the environment.

Peace Chirobo FCCA

A versatile and digitally agile finance professional with extensive diverse industry experience in an ever-changing business landscape.

5 年

Big data coupled with fake news = an intricately double dose of challenges for each one of us deciphering and consuming the information.

要查看或添加评论,请登录

Mike W. Otten的更多文章

  • Unlock value from Digital Twins - 90% of the work is already done

    Unlock value from Digital Twins - 90% of the work is already done

    Let's first get the A, B, C in place on Digital Twins, what is it, what are the benefits and how to deploy a digital…

    3 条评论
  • Personality.Artificial Insights

    Personality.Artificial Insights

    Gain insight into how and why people think, act, and feel the way they do. Perhaps I could blame COVID-19, like many of…

    1 条评论
  • Crisis = Opportunity

    Crisis = Opportunity

    The press release published by the world bank "COVID-19 to Plunge Global Economy into Worst Recession since World War…

    1 条评论
  • Surviving and thriving - opportunity for competitive advantage - ACT NOW

    Surviving and thriving - opportunity for competitive advantage - ACT NOW

    In the state of disagreements on the economic outlook and uncertainties for the industrial sector, many manufacturers…

  • Transform the customer journey from Passon to Purpose

    Transform the customer journey from Passon to Purpose

    The Digital Transformation is accelerating NOW and for obvious reasons FAST! Where B2B in most sectors are, or soon…

  • A quick guide to validate .AI value

    A quick guide to validate .AI value

    Finally, you got the grip on the jargon of the industrial IoT and (like me) did learn fast, after proving some…

    2 条评论
  • We need more X to enable ??

    We need more X to enable ??

    We use emojis like ?? what could have different meanings depending to the context. While I'm passionate about growing…

    2 条评论
  • What makes a Commercial Building Smart?

    What makes a Commercial Building Smart?

    Let me start fresh at MMXX with a teaser from Bill Gates; "You can completely ignore whoever tries to close their…

    1 条评论
  • The casino 4.0 - Internet of Things

    The casino 4.0 - Internet of Things

    Predictive Maintenance is the name of the game. Operators maintaining high value assets and perform reliability on a…

    1 条评论
  • The useless machine

    The useless machine

    A useless machine is a device which has a function but no direct purpose. It may be intended to make a philosophical…

    4 条评论