Evaluating the AI Oracle approach
Karthik Kalyanaraman
Cofounder and CTO, Langtrace AI | OpenTelemetry Contributor
Recently, I came across this tweet about the AI Oracle approach for improving the accuracy and quality of responses for your LLM application. The technique is super simple:
I got curious about this and decided to do some evaluations on this approach. Sharing some metrics/measurements in this post.
This one is pretty obvious, the latency on having all 3 LLMs generate a response and picking the best out of the 3 is high. But, I do recognize that this can be improved by parallelizing the operations.
Ran the following tests for both the combined AI Oracle approach and using a single LLM:
AI Oracle Approach
Results for the AI Oracle approach: For some reason, it could not pick up the realtime information even once. I am sure with some prompt engineering, this metric can be improved. It did poorly on Adversarial testing - mostly because Claude and Pplx's responses.
Claude (claude-3-opus-20240229)
As expected, Claude did not do well on Realtime testing. But, interestingly, it did not do great with adversarial and consistency tests either.
领英推荐
GPT4
Again, GPT4 does not have realtime capabilities. But it did extremely well on everything else except consistency checks where the responses were structured quite differently each time.
Perplexity (pplx-70b-online)
As expected Perplexity's realtime capabilities are unmatched. But, it did not do that well with adversarial and consistency tests which in turn skewed the metrics for AI Oracle approach as well.Notably, the quality of responses from Perplexity were far better than the rest.
In conclusion, you can get to a near perfect score for the AI Oracle approach with a bit of prompt engineering. But you definitely lose performance in the process. Even when parallelized, it is only as slow as the slowest LLM. Token usage/cost is also going to be higher.
Finally, if you are curious, all these evaluations were done using Langtrace - an open source LLM monitoring and evaluations tool that we are currently developing.
Signup for a free here: https://langtrace.ai/signup
Checkout the project on Github: https://github.com/Scale3-Labs/langtrace