My Highlights from ODSC West '23

My Highlights from ODSC West '23

Last week I had the opportunity to attend the Open Data Science Conference in San Francisco.?This year the attention shifted from traditional Machine Learning and MLOps to Large Language Models, LLMOps, and GenAI. Here are some key takeaways from my four days in the heart of the Silicon Valley:

Large Language Models require a change in the way we think about Cloud Infrastructure.

  • Prior to the advent of LLM’s, ML Cloud Infrastructure was designed to mostly provide the compute power and storage space to train models at scale.?To support LLM use cases, enterprises need to leverage Vector Databases. Integrating Vector Databases in your ML applications will play a determining role in the success of your use case in production. In addition, traditional ML Use cases revolved around frameworks such as SparkMllib, Dask, XGBoost, etc. In this context, when model objects are created they rarely exceed significant file size.?Nowadays, just loading a GPU trained model into your platform can represent a significant engineering challenge. Even with the rise in use of LangChain and HuggingFace Transformers API's, Enterprise Cloud storage and MLOps pipelines need to be rethinked to accommodate models of much larger size, potentially requiring integration with advanced DevOps tooling and increased Cloud Resources.?
  • Distributed GPU workloads are becoming increasingly common. Until now these were primarily focused on training Deep Learning models from scratch.?With LLM’s, distributed GPU sessions are now increasingly vital in the model serving stage. Ray Serve is a framework for scaling production models to potentially thousands of replicas and batch serving requests with GPU's.?Ray aims to simplify the challenges related to training and fine-tuning models at scale.

MLOps vs LLMOps:?The concept of Model Pipeline is rapidly changing

  • In the context of traditional ML use cases the model factory was a common pattern to create a conveyor-belt style workflow taking experiments built with frameworks such as SparkMllib through a centralized QA testing process and then a final production API endpoint.?With LLM’s, models are not trained from scratch but loaded through a 3rd party API and fine-tuned with proprietary and contextual data.?Compared to traditional ML use cases, LLM’s are fine-tuned less frequently but on larger and unstructured datasets. According to some, LLM’s will inevitably lead to fewer models applied to more ML tasks being supported in production. However, not everybody at the convention seemed to agree.?
  • LLM’s model selection and evaluation is more difficult. Whereas traditional ML use cases involved the comparison of models by means of established metrics and mature academia research, LLM evaluation techniques are new and less well understood in comparison. Quantifying ROI from LLM use cases is even more challenging. Likewise, quantifying the risk associated with a wrongly tuned production LLM is a similarly complex challenge.?
  • Establishing proper ML Governance around LLM requirements requires rethinking ML Ops Pipelines and Tooling. DevOps and Data Engineering pipelines to support the development and maintenance of production LLM’s require an increased emphasis on data security, user security, and governance. Datesets no longer necessarily reside in structured tables that can easily monitored and queried via traditional Cloud and Datawarehousing tools. Similarly, experimentation pipelines supporting fine-tuning of models need to integrate with leading LLM API’s (e.g. HuggingFace Transformers). MLFlow has been adding support for new integrations and for now is still the de facto standard for Machine Learning Engineering in the enterprise but the conference saw a number of vendors aiming to impose themselves in the space with tooling solely designed for supporting LLM use cases.

To what extent will LLM’s impact Data Engineering Pipelines?

  • As mentioned above Ray has entered the ML space and has gained a lot of attention primarily by lowering the entry barrier for Distributed GPU training and by providing integrations with many ML and LLM frameworks and tools. RayOnSpark and SparkOnRay are two recent projects that attempt to integrate Ray and Spark. In general, Spark offers far more advanced capabilities than Ray for data engineering use cases. Spark is well tested at scale and the de-facto standard for map-reduce based workloads.?Despite its success, at least for now Ray is complementary to Spark and a primary candidate for machine learning, non map-reduce based workloads only. Although it does provide a distributed dataset API, data partitioning capabilities, and the ability to work with data at scale, Ray does not seem a good alternative to build and support data engineering pipelines at scale. At least for the foreseeable future, enterprises should be aiming at integrating pipelines between Spark and Ray.?
  • Feature stores were introduced in the last few years to manage the end-to-end lifecycle of features: from training models, to batch inference, to providing low-latency access to features by online-applications for online inference.?In simple terms, Feature Stores allow us to think of ML use cases in terms of actionable quality data pipelines rather than just disparate models and datasets. It will be really interesting to see if/how feature stores will be reinvented in the LLM era as models will increasingly not require being trained from scratch. However, according to industry leaders, feature stores will be increasingly utilized to feed streams of user-specific data to prompting applications in order to supercharge the application with up to date contextual information.?I am very curious to see how this develops. In either case, it seems that Feature Stores will not go away as they play a central role in traditional ML use cases.

The conference featured guest speakers from many leading tech innovators. Among them:

  • Google reviewed PaLM 2, a state-of-the-art language model with improved multilingual, reasoning and coding capabilities. PaLM 2 was made public in May 2023 and is used to generate outputs such as articles, stories, poems, emails, and even programming code. Bard AI uses PaLM 2. The LLM was trained using data worth 3.6 trillion tokens or millions of words of data.?
  • Meta held a talk on Llama 2, a family of LLMs like GPT-3 and PaLM 2. Llama 2 Chat LLMs are optimized for dialogue use cases and outperform open-source chat models on most benchmarks they tested. Based on Meta’s human evaluations for helpfulness and safety, the company says Llama 2 may be “a suitable substitute for closed source models.”
  • Cerebras discussed the Wafer Scale Engine (WSE). The WSE is a chip roughly 50x larger than the largest Nvidia GPU including hundreds of thousands of AI-optimized cores. The chip can be used by enterprises to develop their own GPT models. Recently, Cerebras announced the launch of Jais 30B in collaboration with Core42, a UAE-based national-scale enabler for Gen AI. The LLM is a 30 billion parameter pre-trained bilingual large language model for both Arabic and English, trained on a dataset containing 126 billion Arabic tokens, 251 billion English, and 50 billion code tokens. Most notably, with WSE Cerebras aims at reducing the complexity of working with LLM’s by giving you a cluster-scale AI compute resource with the programming ease of a single desktop machine.?

While the conference was primarily focused on LLM’s and GenAI, there were still a few very interesting talks and workshops that I’ve had the pleasure of trying at the Conference:

  • Dynamic Time Warping: this is a supervised and unsupervised learning technique for time series data. I first experimented with DTW four years ago in Cloudera Machine Learning and I was happy to learn that this is finally being used to process heart signals and other healthcare applications.?
  • Monitoring Data Drift: Evidently.AI offers a full suite of tools for monitoring data drift in machine learning use cases. I first used Evidently in Cloudera Machine Learning two years ago with the Model Monitoring Applied Machine Learning Protoype and it was?
  • Data Labeling: just like feature stores, LLM’s have the potential to change the dynamics around this fundamental challenge in Machine Learning. I learned about Label Studio, a fully open source package and tool to annotate data. It just launched a capability to augment data labeling tasks with an AI powered agent. I look forward to trying Label Studio in Cloudera Machine Learning.
  • Dagster is a data engineering pipeline orchestrator with data quality capabilities built from the ground up. It directly compares itself with Great Expectations and Airflow and claims to offer a superior alternative by better aligning data quality checks and orchestration logic with a fully declarative pipeline declaration model. However, it does not necessarily aim at replacing Great Expectations. I look forward to trying Dagster in Cloudera Machine Learning in order to orchestrate an MLOps pipeline.

With this, we come to the end of my highlights from ODSC West '23.?In summary, the conference saw a big shift in attention towards LLM's and Gen AI. While the ML industry has witnessed the advent of new tools and products before, this is a truly revolutionary moment and Gen AI use cases are only going to become more and more relevant in our day-to-day activities.

Although evaluation and selection techniques are still in relatively early stages and undergoing significant development efforts, enterprise ML pipelines are already changing as large organizations are racing to productionize use cases with potential for unprecedented productivity gains and competitive advantage.

In 2024 it will be interesting to see which AI-powered use cases have put most models in production. True success and ROI in the enterprise will depend on whether organizations will be able to integrate the complex and fast-changing requirements around Gen AI. In other words, the ol' Hidden Technical Debt of Machine Learning will continue to be a fundamental theme for years to come.

Shaun Ahmadian

Product at Cloudera

1 年

Great points. How generalized LLMs can become is up for debate. Knowledge retrieval seems the most promising area. It almost seems like Vector DBs is in some ways the feature store for LLMS. Storing knowledge features (vectors) to be used for augmentation. It’s a fascinating space to watch. Curious if any talks centered around regulations?

Robert Hryniewicz

GTM | Enterprise AI

1 年

Thx for the writeup!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了