LLM Trends and Technologies
Image generated with Stable Diffusion v. 2.1

LLM Trends and Technologies

NLP Summit 10/2023 write-up pt. 1/2

The past year has been eventful in the fields of Machine Learning (ML) and Artificial Intelligence (AI). From October 3rd to 5th, 2023, John Snow Labs organized their second Natural Language Processing (NLP) Summit of the year. This event included a variety of informative presentations from both emerging start-ups and established players in the Language Model sector. This is the first part of a two-part conference write-up; it discusses trends and technologies, while the subsequent part will focus on applications in healthcare.

Executive Summary

Companies and organizations constructed small to medium-sized proof of concepts (POCs) utilizing Large Language Models (LLMs) in 2023, achieving varying levels of success. As always, finding Product/Model-Market-Fit is essential. A common challenge arises from a misalignment between customer expectations and initial LLM capabilities. Multiple methods can be employed to tailor an LLM to a specific business use case, including Re-Training, Fine-Tuning, external Connectors, Prompt-Engineering, and Retrieval Augmented Generation (RAG). In practice, most successful implementations integrate an ensemble of "conventional" ML models, such as Classifiers or Regressions, with an LLM to preprocess data and reduce the volume of LLM queries. This approach primarily addresses the limitations associated with context length (the extent of text that can be input into an LLM prompt) and the cost-per-query for LLM inferences.

How do I bring my POC into production?

M Waleed Kadous from Anyscale gave a very good talk on how to bring LLMs and different Models into production. My main takeaway was the concept of hybrid implementations of Retrieval Augmented Generation (RAG) with a Supervised Classifier to decide which which model to run and what context to provide to each prompt. Anyscale also open sourced code and documentation for their APIs and general concepts on github.

Of course GPT-4 performs better - but the bills are going to hunt you in your sleep!

Yuval Belfer from AI21 Labs gave a very entertaining talk on his experiences consulting clients on how to take initial LLM-POCs into production and why the largest model doesn't guarantee the best output. Unit-costs per inference are also a big issue. I've also learned a new type of bias from him, called "Recency Bias" which refers to prompt design and the fact, that most models place a lot of emphasis on the last sentences of the prompt. Therefore restructuring your prompts to include the most important instructions at the end can drastically improve your results.

He also shared his rule of thumb on when to fine-tune and when to use RAG in order to get the best results for a given use case:

  • "The model should write like ..." --> Finetuning to change the style
  • "The model should write about ..." --> RAG to decrease hallucinations


Retrieval Augmented Generation (RAG) as presented by Waleed Kadous
RAG as presented by Yuval Belfer

RAG and Vector Databases for the win

To enable RAG, fast retrieval of embedding-vectors is a must. Bob van Luijt from Weaviate presented a demo on how to integrate LLM APIs with a vector database. This enables embeddings to improve over time by filling gaps with generated data in the background. Their write-up can be found here. Also, this blog post from Roie Schwaber-Cohen provides a good primer on vector databases.

OK, so I need a Vector-DB. Now what?

Anton Troynikov , founder of Chroma , an open source Vector-DB, gave a great talk about how to conceptualize Vector-DBs as "programmable memory" for ML-models (he also had the best Mic-Setup!). This was easily one of the highlights of the first day, since he gave a no-BS overview of the current problems in information retrieval as well as research areas to watch for in the near future. I highly recommend to check out the Chroma blog and a one of his interviews.

Interestingly Veysel Kocaman, PhD from John Snow Labs also highlighted some of the same issues and proposed to treat the whole RAG process as a pipeline with many tunable parameters, starting at preprocessing and OCR. He summed it up in one sentence:

"I have seen many talks and tutorials where people are playing with 10 PDF documents. However, in production, selection of the Embedding-model and Vector-DB really matters"
All RAG sub-processes that can be individually tuned. As presented by Veysel Kocaman. See also: "Random but valuable Tips and Tricks" for some Context on how to improve OCR accuracy.

As you've probably noticed by now - RAG is all the hype. Both because fine-tuning is expensive, but also because it allows production systems to integrate additional, up-to-date knowledge easily.

Itay Zitvar from Hyro also pointed out, that they prefer RAG over fine tuning, since use cases change frequently and training adapters for every customer seems infeasible. However, he shared some lessons learned during their hybrid approach to integrating LLMs into their production pipelines:

Lessons learned as presented by Itay Zitvar

Quality of Inference, Risk and Bias

Generated with iloveimg.com

Unsurprisingly, a lot of presenters highlighted that they encountered biases that they didn't encounter during initial validation. John Snow Labs released their new tooling "langtest" which is an evolution of last-years "nlp-test" library, that can be used to generate and execute automated tests for language models. I've used nlp-test in the past and really liked the platform-agnostic approach that integrates with all the popular NLP libs.

For high-risk applications (HealthCare, Chemical Engineering, Finance and so on) all presenters also showed Human-In-The-Loop (HITL) steps in their pipelines towards the end - so all "automatic validation" and "ML-Judges" still seem to exist mostly in academic settings and low-risk applications.

Clinician Oversight for AI Tasks, as presented by

During Medical School, we used to joke that "when in doubt - always choose answer C" in our multiple-choice tests. Turns out LLMs do the same thing ??

Additionally, biases also emerge when evaluating LLM output with LLMs. Davis Blalock from Databricks Mosaic Research gave a great talk on how "the big boys" train LLMs and why our experiences with scikit-learn do not translate into the scale of LLM training. I urge you to read this statement and draw your own conclusion from "this is so cool! Accuracy 1.0!!! Let's publish / put it into production". Here's a more comprehensive write-up of the incidence: read me.

Random, valuable tips and tricks

Prof. Chreston Miller from Virginia Tech presented their results on developing an OCR pipeline for legal old documents, which included some surprising findings on how to best preprocess images before OCR even starts:

Image-preprocessing for best OCR results, as presented by Chreston Miller PhD




Markus Bockhacker

Orthopaedic Surgeon and Data Scientist

1 年

I‘ve just posted pt. 2 here:

回复
Jose Pablo Alberto Andreotti

Data Scientist/ Machine Learning Engineer at John Snow Labs

1 年

Really good summary, looking forward to part II !

Maziyar P.

Generative AI Leader | AI/ML Principal Engineer | Building Enterprise LLM Solutions | Big Data & ML Lecturer

1 年

Great recap! Thanks Markus Bockhacker

要查看或添加评论,请登录

Markus Bockhacker的更多文章

  • LLM Applications in HealthCare

    LLM Applications in HealthCare

    NLP Summit 10/2023 write-up pt. 2/2 This is part two of my write-up of the John Snow Labs NLP Summit from Oct.

    2 条评论
  • Healthcare NLP Summit write-up

    Healthcare NLP Summit write-up

    I spent the last two evenings attending John Snow Labs' virtual "Healthcare NLP Summit," and I found it to be a…

    2 条评论

社区洞察

其他会员也浏览了