Productionizing Machine Learning (ML) pipelines in real time - What can be learn from the past?
I was reading slides on slideshare - Big data architecture: Hadoop and Data Lake published by William EL KAIM, (Enterprise Architect). It is a three-part series worth skimming through anyone who is working with big data and associated applications and tools.
During my read, there were few slides which caught my attention (slide 53 in part 1) and realized that it is still hard to automate and productionizing Machine Learning (ML) pipelines in real time. Cloud vendors AWS, Azure, and Google have their own integrated environments but when I did further research found a slide and video about Databricks.
The question is have the industry learned lessons from the past (focus on business and not the moving parts of technology), I do not think so.
Current state of ML tools: Notice the percentage, it has to go above 50% from 7-8%
Current state of popularity of ML Libraries: These libraries should be behind ML tools and these tools should be intelligent enough to recommend libraries based on business case can validate based on data, provide recommendations.
Background: There are so many moving parts in the big data space and each of them needs some kind of expertise for each moving parts (Hadoop, Spark, Hive, Kafka etc.). Technology companies invested in grooming talents and were successful in building big data applications and products, first on premises but slowly moved to some cloud vendor, realizing the overhead.
Gartner report couple of years ago pointed out many reasons why big data implementation got delayed by the industry, and the biggest one was that it is very hard to setup environment (clusters) on premises as a business loses it focus and tries to do what they are not good at. Thus Businesses, states, and federal application are moving rapidly to the cloud now. Billions of dollars were wasted to understand this fact, now the question is that should Businesses, states and federal decision-makers can make another expensive mistake to learn that for AI (next big push) requires an integrated environment in the cloud?
Time is not far when businesses, state and federal customers will demand real-time AI (know before it happens). Data (lots of it) is the food for Machine Learning and AI, integrated AI environment makes it easier to harness the hidden intelligence from data without much diversion on technology and its moving parts. Data engineers, data scientists still struggle to find a seamless environment to conduct experiments and publish them on a click of a button.
Azure, AWS and Google are working on it and hope to get better in coming months and years.
Databricks is one of the vendors which is well integrated with Azure and AWS and has the potential to run ML models on live data.
Using Databricks community edition (free) is worth a try.
#AWS, #Azure, #Google, #MachineLearning
Source of pictures:https://businessoverbroadway.com/2019/01/27/most-popular-machine-learning-frameworks-and-products-used-by-data-professionals/
Software Engineer - Data Platform
4 年Databrick is an awesome tool. I played with it and designed few ETL using AWS.
Enabling organizations with their Data & GenAI transformation journey
5 年Howard Levenson, Parth Vakil