Can you use AutoML for Generative AI Development?
Dalle-E: robots building robots

Can you use AutoML for Generative AI Development?

If you've built a Generative AI app, you'll know that much of getting generative AI "to work" involves endless prompt engineering, testing an ever-expanding list of LLMs, and figuring out what “good” even looks like.

For someone like me who did traditional ML and deep learning for many years, this sounds very much like the undifferentiated work everyone disliked with traditional ML: should you use a random forest or binary classification? Should you use 40 trees or 60? What about tree depth?

Traditional ML practitioners used automation and techniques like Bayesian optimization of hyperparameters, i.e., collectively termed AutoML, as a means to remove this undifferentiated work.

While AutoML has not been without limitations (overfitting being the primary) it has become an effective way to (1) empower teams with limited DS/ML expertise to perform simple data science and (2) a surprisingly effective way for experts to get to a useful starting point. None of the DS-ML Platforms today would be complete without AutoML capabilities.

The success of AutoML for Traditional ML begs the question:

Can we use AutoML techniques to similarly remove the undifferentiated work in building Generative AI and get to high-quality results faster?

That is the vision we set out to build with the Verta GenAI Workbench and it's been exciting to see the effectiveness of AutoML techniques for GenAI.

So, in this post, I’m going to draw the parallels between AutoML for traditional vs. generative AI, highlight what’s solved, what remains open, and where we can go from here.

First, a quick primer on AutoML for Traditional (predictive) ML:

AutoML seeks to take an input dataset {X, y} and seeks to produce the function “f” such that f(X) → y maximizes some quality metrics (e.g., accuracy.) It does so by exploring the space of models ("f"-s), their hyperparameters, and different transformations of X & y.

Here are the steps involved and the output:

  1. Select candidate models
  2. Select model hyperparameter variations
  3. Select feature engineering strategies
  4. For all(*) combinations of above and potential ensembles, train models on the training split of the dataset
  5. Evaluate models on the test split dataset (or via cross-validation)
  6. Select the model that maximizes the chosen quality metric

Output from AutoML typically looks like this:

Sample Auto Sklearn Output


Now on to GenAI, how can we use the same AutoML techniques for GenAI?

AutoML for GenAI has the same philosophy as AutoML for traditional ML with a few tweaks as shown in the table below. For example, since LLMs are typically pre-trained, there is no need to perform model training. Similarly, instead of varying hyperparameters during model training, we vary model input parameters like prompt, temperature, chunking, and so on.

AutoML for Traditional ML vs. AutoML for Generative AI


While, on the surface, it seems like AutoML for GenAI can skip several steps, the steps that remain are challenging to automate and currently ill-defined. Specifically, solving AutoML to Generative AI brings up three challenges:

(1) Experimenting with prompt variations: While hyperparameters are usually numeric (e.g., “C” value) or belong to a small number of categories (e.g., gradient descent algorithm), the universe of prompts is much more open-ended and complex: you can view it as a very high dimensional vector (as some work does.) As a result, techniques like Bayesian optimization are much more challenging to apply effectively.

(2) Datasets: Often, when beginning a GenAI project, the train/test dataset is hard to build. Gen AI may represent a new type of task where training data has not been captured or even available, e.g., when building a bot to turn engineering notes into blogs, there may be no data about past engineering notes.

(3) Evaluation & metrics: While computing accuracy in traditional ML is very formulaic, evaluating LLM results today is extremely ad-hoc, with many teams resorting to “vibe checks.” In addition, the automated LLM evaluators available today still have inconsistent performance. Finally, evaluations are further complicated by the fact that it can be hard to describe what good may look like unless you see some examples (“the language is too flowery”, “this is too long.”)

Some Solutions:

Although these challenges make AutoML for GenAI different from AutoML for traditional ML, we have found that these hurdles are not insurmountable. Here are some of the techniques we have been using in the Workbench.

Autoprompting and AI-powered prompt refinement: New and better ways to prompt LLMs are constantly being developed (e.g., Chain-of-Thought, Reason-And-Act etc.) In addition, meta-prompting work has begun to show promise for helping craft effective prompts. LLM's natural writing talent combined with meta-prompting is now good enough to generate strong prompts automatically. In Verta PromptBrew, we utilize these techniques to create diverse prompts of high quality. Automatically generated prompts are not perfect, but in most cases, 100% better than what a beginner would write.?

Verta PromptBrew creating prompts for the task: "Write a LinkedIn post from a blog article"

Datasets: Depending on the task, it is possible to synthetically generate a starter dataset for your GenAI app. Such approaches are frequently used for RAG problems, e.g., as described in this Hugging Face blog. Beyond synthetic data, this step can be hard to automate. The good news is that to get to decent results, you don’t need 100s of examples, you need tens. That order of magnitude makes the problem much more tractable.?

Evaluation: LLM evaluation is hard, subjective, and an active research problem. However, a few techniques can go a long way.

(1) For AutoML, we are looking to establish a relative ranking of variants to pick the best one. Pairwise comparisons producing ELO scores are a perfect fit for this task. Moreover, you typically only need tens of comparisons. The number of these comparisons can be further reduced by smartly choosing the pairs to compare. This is a key approach we use in the Workbench.

(2) Off-the-shelf LLM-based evaluators are improving daily and can be used to augment human labeling. However, we have found that although these evaluators provide a decent sanity check, they aren't great at fine-grained quality checks.

(3) Ultimately, we think that human labeling + few-shot-prompting based evaluators are the key to building evaluators that capture human preferences. This is where some of latest experiments have focused.

With these techniques for prompting to evaluation, a Verta Workbench user can get from a User Task description to a high-quality app in 23 minutes.

That’s pretty darn great. Again, the goal of AutoML is not getting to the SOTA result, it is to get to a result that is “good enough” and iterate from there.

Verta Workbench Leaderboard Example


I’m excited about the promise of AutoML for Generative AI and as a means to get to useful Generative AI faster. If you’ve done these explorations yourself, we would love to hear your experiences and collaborate with you.

And if you want to see how well this can work in real life, give it a spin at app.verta.ai!

Vincent Valentine ??

CEO at Cognitive.Ai | Building Next-Generation AI Services | Available for Podcast Interviews | Partnering with Top-Tier Brands to Shape the Future

7 个月

Excited to see the promising results! Manasi Vartak

回复
Manasi Vartak

Chief AI Architect at Cloudera | Prev Founder & CEO at Verta (acq by Cloudera) | MIT PhD in AI Infrastructure

7 个月

If you are curious about what PromptBrew looks like without signing up, here's a separate link: https://www.verta.ai/promptbrew

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了