???? “Hallucination Index” Ranks LLMs for Popular AI Use Cases
DTP #28: PLUS: Open-source AI model libraries on the rise
Looking to fill the need for a comprehensive LLM benchmark report that provides a measurement of LLM hallucinations, Galileo Labs has put together a “Hallucination Index” featuring prominent existing LLMs.?
They claim to have covered 3 major blind spots that are overlooked by other LLM benchmarks:?
We look at the 3 major use cases for which they have ranked LLM capabilities below:?
?? AI in Business?
Replicate Raises $40 Million For Open Source AI Model Library?
An article from Forbes highlights Replicate’s recent funding round, valuing the startup at $350 million. The platform hosts over 25,000 open-source AI models and has witnessed a surge in popularity among 2 million software developers.?
Key Points:?
Competition for Replicate exists from other startups like Together AI and established tech giants like Nvidia, Google, Amazon, and Microsoft offering similar cloud-based machine learning services.?
Question & Answer without Retrieval (RAG)
Use case where machine learning models answer questions directly without relying on a predefined database or external context retrieval. RAG (Retrieval-Augmented Generation) is a specific implementation of this concept, developed by OpenAI, that combines aspects of both retrieval-based and generative models.?
GPT-4 by OpenAI stands out as the top performer in Question & Answer without Retrieval (RAG) tasks, boasting a high Correctness Score of 0.77. Its exceptional accuracy and minimal tendency for hallucination reinforce its dominance in applications involving general knowledge.??
In the realm of open-source models, Meta's Llama-2-70b leads the pack with a Correctness Score of 0.65. However, other models such as Meta’s Llama-2-7b-chat and Mosaic ML’s MPT-7b-instruct exhibited higher susceptibility to hallucinations, scoring 0.52 and 0.40, respectively, in similar tasks.?
The evaluation suggests GPT-4-0613 as the recommended choice for dependable and precise AI performance within this task category.?
Question & Answer with RAG?
Use case that combines elements of retrieval-based and generative models to facilitate question answering tasks. RAG integrates transformer-based language models (such as GPT models) with dense vector retrievers (like DPR - Dense Passage Retrieval) to enhance the process of answering queries.??OpenAI's GPT-4-0613 demonstrated prowess, securing the lead with a Context Adherence score of 0.76, while the more economical and faster GPT-3.5-turbo models, specifically -0613 and -1106, closely matched its performance, attaining scores of 0.75 and 0.74, respectively.?
Unexpectedly, Hugging Face's Zephyr-7b, an open-source model, surpassed the notably larger Llama-2-70b from Meta, securing a Context Adherence Score of 0.71 compared to 0.68, challenging assumptions about the inherent superiority of larger models.?
Conversely, TII UAE's Falcon-40b (Context Adherence Score = 0.60) and Mosaic ML's MPT-7b (Context Adherence Score = 0.58) lagged in this specific task.?
For this task type, the Index recommends GPT-3.5-turbo-0613 as the suitable choice.?
Long-form Text Generation?
The ability of these models to generate extended, coherent, and contextually relevant text passages or documents.?
Once again, GPT-4-0613 from OpenAI exhibited superior performance, showcasing a minimal tendency for hallucination with a high Correctness Score of 0.83. Concurrently, GPT-3.5-turbo versions (1106 and 0613) closely matched this proficiency, scoring 0.82 and 0.81, respectively, offering potential cost-effectiveness and improved performance.?
Notably, Meta's Llama-2-70b-chat from open-source alternatives competed neck and neck with GPT-4, displaying similar capabilities with a Correctness Score of 0.82, providing a viable and efficient alternative for this task. Conversely, TII UAE's Falcon-40b (Correctness Score = 0.65) and Mosaic ML's MPT-7b (Correctness Score = 0.53) lagged in effectiveness.?
The recommendation from the Index suggests Llama-2-70b-chat as an optimal choice, striking a balance between cost efficiency and performance in the domain of Long-form Text Generation.?
Final Thoughts?
Open AI's Superiority:?
Open-Source Cost Efficiency:?
Models for Specific Task Types:?
Long-form Text Generation: Meta's open-source Llama-2-13b-chat emerges as a commendable alternative to Open AI's models.?
Question & Answer with RAG: Hugging Face's Zephyr model stands as a nimble yet powerful substitute for OpenAI, with an inference cost 10 times lower than GPT-3.5 Turbo.?
Galileo's Evaluation Metrics?
?? Platform Highlight?
Akkio: A business analytics and forecasting tool for data analysis and outcome prediction. Aids in predictive analysis, marketing, and sales.?
Together AI: Cloud-based Gen AI platform providing tools for constructing open-source generative AI and infrastructure for developing AI models.?
Chingu AI: AI-powered content creation, project management, and productivity platform.?
?? From the Web?
Mozilla's innovation group introduced llamafile, an open-source solution converting multi-gigabyte LLM (Large Language Model) weight sets into a single cross-OS binary, streamlining distribution across macOS, Windows, Linux, and BSD systems without installations.?
NVIDIA Unveils Enhanced NeMo Framework, Improves LLM Training on H200 GPU NVIDIA introduced an upgraded NeMo framework, enhancing LLM training on their H200 GPU, specifically benefiting complex models like Llama 2. These advancements focus on cloud-native capabilities, improved parallelism, and enhanced performance, meeting growing demands for efficient and diverse LLM training.?
Why OpenAI developing an artificial intelligence that’s good at maths is such a big deal OpenAI's Q* algorithm, represents an advancement in AI's quest for comprehensive reasoning skills. Despite AI's foundation in mathematics, challenges persist, such as limitations in reasoning and creativity.?
?? Social Highlight
Data Scientists on Reddit discuss the common fundamental you see Data Scientists and MLEs lacking: Post?
How much better than GPT 3.5 is GPT 4?: A tweet?
?? Prompt of the week?
I want you to act as a SQL code optimizer. The following code is slow. Can you help me speed it up? [Insert SQL]?
See you next week!