登录查看更多内容

Evaluating the AI Oracle approach

Karthik Kalyanaraman

Cofounder and CTO, Langtrace AI | OpenTelemetry Contributor

发布日期: 2024年4月10日

Recently, I came across this tweet about the AI Oracle approach for improving the accuracy and quality of responses for your LLM application. The technique is super simple:

https://twitter.com/mattshumer_/status/1777382373283299365

Send the request to 3 LLMs - Claude, GPT4, and Perplexity.
Give the responses to Claude again and prompt engineer to pick the best and accurate response.

I got curious about this and decided to do some evaluations on this approach. Sharing some metrics/measurements in this post.

This one is pretty obvious, the latency on having all 3 LLMs generate a response and picking the best out of the 3 is high. But, I do recognize that this can be improved by parallelizing the operations.

Ran the following tests for both the combined AI Oracle approach and using a single LLM:

Factual Accuracy - Evaluated for correctness of responses.
Realtime data - Evaluated based on asking information related to realtime data.
Adversarial Testing - Evaluated on whether the LLM is able to pickup the signal correctly by placing the question in between a bunch of garbage data. The LLM was given a positive score if it correctly responded to the question without mentioning the garbage data.
Consistency checks - Evaluated on whether the LLM gave a response consistently when the same question was asking many times. Mainly looked for structural consistency of the response.
Quality - Evaluated on the quality - sentence structure, adherence to the prompt etc.

AI Oracle Approach

Results for the AI Oracle approach: For some reason, it could not pick up the realtime information even once. I am sure with some prompt engineering, this metric can be improved. It did poorly on Adversarial testing - mostly because Claude and Pplx's responses.

Claude (claude-3-opus-20240229)

As expected, Claude did not do well on Realtime testing. But, interestingly, it did not do great with adversarial and consistency tests either.

领英推荐

Navigating AI Challenges: Strategies to Overcome…

Doug Rose 9 个月前

The End of the Unstructured Data Era

VAST Data 1 个月前

To Data & Beyond Week 21 Summary

Youssef Hosni 9 个月前

GPT4

Again, GPT4 does not have realtime capabilities. But it did extremely well on everything else except consistency checks where the responses were structured quite differently each time.

Perplexity (pplx-70b-online)

As expected Perplexity's realtime capabilities are unmatched. But, it did not do that well with adversarial and consistency tests which in turn skewed the metrics for AI Oracle approach as well.Notably, the quality of responses from Perplexity were far better than the rest.

In conclusion, you can get to a near perfect score for the AI Oracle approach with a bit of prompt engineering. But you definitely lose performance in the process. Even when parallelized, it is only as slow as the slowest LLM. Token usage/cost is also going to be higher.

Finally, if you are curious, all these evaluations were done using Langtrace - an open source LLM monitoring and evaluations tool that we are currently developing.

Signup for a free here: https://langtrace.ai/signup

Checkout the project on Github: https://github.com/Scale3-Labs/langtrace

要查看或添加评论，请登录

Karthik Kalyanaraman的更多文章

Evaluate CrewAI Agents using Langtrace

2024年12月20日

Evaluate CrewAI Agents using Langtrace

With the latest update, you can now evaluate both entire agent sessions and individual operations across your entire…
Attribute Extraction from Images using DSPy

2024年11月18日

Attribute Extraction from Images using DSPy

Introduction DSPy recently added support for VLMs in beta. A quick thread on attributes extraction from images using…

1 条评论
Automatic Prompt Generation using DSPy

2024年10月31日

Automatic Prompt Generation using DSPy

Introduction In this post, I will show you a simple implementation of "automatic prompt generation" for solving math…

1 条评论
OpenAI launches Evals, Tracing & Fine tuning

2024年10月2日

OpenAI launches Evals, Tracing & Fine tuning

Introduction ?? OpenAI makes an official entry into application specific Evals and LLM Ops tooling. OpenAI announced…
Building Compound AI systems

2024年9月26日

Building Compound AI systems

Introduction In this article, I will explain how I think about building and optimizing compound AI pipelines with…

1 条评论
Why you need OpenTelemetry based Observability for your AI apps

2024年4月25日

Why you need OpenTelemetry based Observability for your AI apps

Introduction With the advent of LLMs, modern software development is going through an important shift - from mostly…
The year of Podcasts and Audio as UX vs the rest

2019年1月3日

The year of Podcasts and Audio as UX vs the rest

In 2019, I think the biggest threat to social media is going to be digital audio streaming (Podcasts and digital music…

2 条评论
LinkedIn’s nifty new feature?—?Career Advice

2017年10月17日

LinkedIn’s nifty new feature?—?Career Advice

Recently, I was dabbling with LinkedIn’s features when I came across this new spot on my profile’s “Your Dashboard”…

5 条评论

See all articles

Evaluating the AI Oracle approach

Karthik Kalyanaraman

Cofounder and CTO, Langtrace AI | OpenTelemetry Contributor

领英推荐

Karthik Kalyanaraman的更多文章

社区洞察

其他会员也浏览了

To Data & Beyond Week 17 Summary

Understanding Traditional RAG vs GraphRAG

Data is king: The role of data capture and integrity in embracing AI

Data Phoenix Digest - ISSUE 8.2023

Emerging Trends in Data Analytics for 2024

The Evolution of Big Data Analytics: A Decade of Transformation and the Rise of Artificial Intelligence

The difference between Artificial Intelligence (AI) and Data Science – and why it matters

DeepSeek: Revolutionizing Data Discovery and Contextual Intelligence

Vector Databases vs. Knowledge Graphs: Choosing the Right Foundation for Retrieval-Augmented Generation

Why Small Data is Essential for Advancing AI

领英推荐

Karthik Kalyanaraman的更多文章

Evaluate CrewAI Agents using Langtrace

Attribute Extraction from Images using DSPy

Automatic Prompt Generation using DSPy

OpenAI launches Evals, Tracing & Fine tuning

Building Compound AI systems

Why you need OpenTelemetry based Observability for your AI apps

The year of Podcasts and Audio as UX vs the rest

LinkedIn’s nifty new feature?—?Career Advice

社区洞察

其他会员也浏览了

To Data & Beyond Week 17 Summary

Understanding Traditional RAG vs GraphRAG

Data is king: The role of data capture and integrity in embracing AI

Data Phoenix Digest - ISSUE 8.2023

Emerging Trends in Data Analytics for 2024

The Evolution of Big Data Analytics: A Decade of Transformation and the Rise of Artificial Intelligence

The difference between Artificial Intelligence (AI) and Data Science – and why it matters

DeepSeek: Revolutionizing Data Discovery and Contextual Intelligence

Vector Databases vs. Knowledge Graphs: Choosing the Right Foundation for Retrieval-Augmented Generation

Why Small Data is Essential for Advancing AI