Evaluating the AI Oracle approach
Langtrace - Evaluations

Evaluating the AI Oracle approach

Recently, I came across this tweet about the AI Oracle approach for improving the accuracy and quality of responses for your LLM application. The technique is super simple:

https://twitter.com/mattshumer_/status/1777382373283299365

  1. Send the request to 3 LLMs - Claude, GPT4, and Perplexity.
  2. Give the responses to Claude again and prompt engineer to pick the best and accurate response.

I got curious about this and decided to do some evaluations on this approach. Sharing some metrics/measurements in this post.

This one is pretty obvious, the latency on having all 3 LLMs generate a response and picking the best out of the 3 is high. But, I do recognize that this can be improved by parallelizing the operations.

Overall Latency

Ran the following tests for both the combined AI Oracle approach and using a single LLM:

  1. Factual Accuracy - Evaluated for correctness of responses.
  2. Realtime data - Evaluated based on asking information related to realtime data.
  3. Adversarial Testing - Evaluated on whether the LLM is able to pickup the signal correctly by placing the question in between a bunch of garbage data. The LLM was given a positive score if it correctly responded to the question without mentioning the garbage data.
  4. Consistency checks - Evaluated on whether the LLM gave a response consistently when the same question was asking many times. Mainly looked for structural consistency of the response.
  5. Quality - Evaluated on the quality - sentence structure, adherence to the prompt etc.

AI Oracle Approach

Results for the AI Oracle approach: For some reason, it could not pick up the realtime information even once. I am sure with some prompt engineering, this metric can be improved. It did poorly on Adversarial testing - mostly because Claude and Pplx's responses.

Claude (claude-3-opus-20240229)

As expected, Claude did not do well on Realtime testing. But, interestingly, it did not do great with adversarial and consistency tests either.

Claude - Evaluation Results

GPT4

Again, GPT4 does not have realtime capabilities. But it did extremely well on everything else except consistency checks where the responses were structured quite differently each time.

GPT4 - Evaluation Results

Perplexity (pplx-70b-online)

As expected Perplexity's realtime capabilities are unmatched. But, it did not do that well with adversarial and consistency tests which in turn skewed the metrics for AI Oracle approach as well.Notably, the quality of responses from Perplexity were far better than the rest.

Perplexity - Evaluation Results

In conclusion, you can get to a near perfect score for the AI Oracle approach with a bit of prompt engineering. But you definitely lose performance in the process. Even when parallelized, it is only as slow as the slowest LLM. Token usage/cost is also going to be higher.

Finally, if you are curious, all these evaluations were done using Langtrace - an open source LLM monitoring and evaluations tool that we are currently developing.

Signup for a free here: https://langtrace.ai/signup

Checkout the project on Github: https://github.com/Scale3-Labs/langtrace

要查看或添加评论,请登录

Karthik Kalyanaraman的更多文章

  • Evaluate CrewAI Agents using Langtrace

    Evaluate CrewAI Agents using Langtrace

    With the latest update, you can now evaluate both entire agent sessions and individual operations across your entire…

  • Attribute Extraction from Images using DSPy

    Attribute Extraction from Images using DSPy

    Introduction DSPy recently added support for VLMs in beta. A quick thread on attributes extraction from images using…

    1 条评论
  • Automatic Prompt Generation using DSPy

    Automatic Prompt Generation using DSPy

    Introduction In this post, I will show you a simple implementation of "automatic prompt generation" for solving math…

    1 条评论
  • OpenAI launches Evals, Tracing & Fine tuning

    OpenAI launches Evals, Tracing & Fine tuning

    Introduction ?? OpenAI makes an official entry into application specific Evals and LLM Ops tooling. OpenAI announced…

  • Building Compound AI systems

    Building Compound AI systems

    Introduction In this article, I will explain how I think about building and optimizing compound AI pipelines with…

    1 条评论
  • Why you need OpenTelemetry based Observability for your AI apps

    Why you need OpenTelemetry based Observability for your AI apps

    Introduction With the advent of LLMs, modern software development is going through an important shift - from mostly…

  • The year of Podcasts and Audio as UX vs the rest

    The year of Podcasts and Audio as UX vs the rest

    In 2019, I think the biggest threat to social media is going to be digital audio streaming (Podcasts and digital music…

    2 条评论
  • LinkedIn’s nifty new feature?—?Career Advice

    LinkedIn’s nifty new feature?—?Career Advice

    Recently, I was dabbling with LinkedIn’s features when I came across this new spot on my profile’s “Your Dashboard”…

    5 条评论

社区洞察

其他会员也浏览了