登录查看更多内容

Productionizing Machine Learning (ML) pipelines in real time - What can be learn from the past?

Yogesh K.

Enabling organizations with their Data & GenAI transformation journey

发布日期: 2019年3月2日

I was reading slides on slideshare - Big data architecture: Hadoop and Data Lake published by William EL KAIM, (Enterprise Architect). It is a three-part series worth skimming through anyone who is working with big data and associated applications and tools.

During my read, there were few slides which caught my attention (slide 53 in part 1) and realized that it is still hard to automate and productionizing Machine Learning (ML) pipelines in real time. Cloud vendors AWS, Azure, and Google have their own integrated environments but when I did further research found a slide and video about Databricks.

The question is have the industry learned lessons from the past (focus on business and not the moving parts of technology), I do not think so.

Current state of ML tools: Notice the percentage, it has to go above 50% from 7-8%

Current state of popularity of ML Libraries: These libraries should be behind ML tools and these tools should be intelligent enough to recommend libraries based on business case can validate based on data, provide recommendations.

Background: There are so many moving parts in the big data space and each of them needs some kind of expertise for each moving parts (Hadoop, Spark, Hive, Kafka etc.). Technology companies invested in grooming talents and were successful in building big data applications and products, first on premises but slowly moved to some cloud vendor, realizing the overhead.

Gartner report couple of years ago pointed out many reasons why big data implementation got delayed by the industry, and the biggest one was that it is very hard to setup environment (clusters) on premises as a business loses it focus and tries to do what they are not good at. Thus Businesses, states, and federal application are moving rapidly to the cloud now. Billions of dollars were wasted to understand this fact, now the question is that should Businesses, states and federal decision-makers can make another expensive mistake to learn that for AI (next big push) requires an integrated environment in the cloud?

Time is not far when businesses, state and federal customers will demand real-time AI (know before it happens). Data (lots of it) is the food for Machine Learning and AI, integrated AI environment makes it easier to harness the hidden intelligence from data without much diversion on technology and its moving parts. Data engineers, data scientists still struggle to find a seamless environment to conduct experiments and publish them on a click of a button.

Azure, AWS and Google are working on it and hope to get better in coming months and years.

Databricks is one of the vendors which is well integrated with Azure and AWS and has the potential to run ML models on live data.

Using Databricks community edition (free) is worth a try.

#AWS, #Azure, #Google, #MachineLearning

Source of pictures:https://businessoverbroadway.com/2019/01/27/most-popular-machine-learning-frameworks-and-products-used-by-data-professionals/

Embracing Humanity: A Call for Kindness, Learning, and Understanding in a Complex World

2023年12月31日

Nurturing Growth: The Role of AI Centers of Excellence in Technology Consulting

2023年12月18日

ChatGPT - : A Look Back at the Power of Generative AI

2023年12月1日

Tech Giants Reshape Healthcare: A Data-Driven Era

2023年11月11日

Syncing delta tables in two different AZURE subscriptions in a controlled manner (Using Databricks)

2022年5月15日

Reduce Emergency Room waiting times with Azure Logic App and Databricks

2022年5月2日

Azure Synapse Analytics - First Impression - Part 2 - Spark Notebooks

2020年5月25日

Azure Synapse Analytics - First Impression - Part 1

2020年5月24日

Apache Spark - Tune cluster to take advantage of parallelism

2020年1月29日

Apache Hive Performance Tuning Best Practices

2020年1月29日