Math Hallucinations with OpenAI, But Also Some Great Results
This is not just another rant about OpenAI. I actually have something very positive to say, even if in the end, the answer to my prompt was wrong. Many of the hallucinations would not be a real issue if OpenAI provided references and links to any piece of information returned to the user. Indeed, this was the main reason why I created xLLM (see details here).
Perhaps in the future, someone -- maybe me -- will create a meta-LLM that parses prompt results from OpenAI, Mistral, Perplexity and other platforms to get the best out of the mix, and blend with internal embeddings, as augmented data. The first step consists in generating billions of synthetic prompts and then run them through the various apps, maybe even including old-fashioned Google search.
For now, we have to deal with a single platform at a time. This article focuses on OpenAI and my most recent math query. I tried Gemini too, but the results were a lot worse.
To summarize, OpenAI gave a wrong answer to my question. Thankfully, I knew it was wrong. What if you don't, then publish an article or provide paid advice based on that? Anyway, I tried to get more details with a different prompt. Then some magic happened: OpenAI launched a Python script out of nowhere, run it in real time, and essentially told me that the answer to my question was in the output produced by that script. A simple analysis of the output in question would yield the answer. This was great, because it helped me discover a new Python library very useful for what I do.
领英推荐
Unfortunately, OpenAI decided to add one concluding paragraph after that, telling me that the wrong answer (the one from the first prompt) matched the correct answer obtained in the second prompt.
Read the full article, here.
Co-Founder, BondingAI.io
8 个月I also tried another prompt: count the number of occurrences of “000” in all binary strings of length 5. Then, same prompt with “000” replaced by “010”. OpenAI claims the answer is 8 in both cases. Gemini claims it is 3. Both justify the wrong answer using incorrect logic. Even Python gets it wrong, counting non-overlapping occurrences only, coming up with 8 for “000” (wrong) and 11 for “010” (correct).
Glad you are sharing this content, on the edge of fully utilize. Thanks
AI/ML in Fin Crimes and Compliance, Automation, genAI, data analytics translation, innovation, solution architecture - connecting business with data analytics, data science and engineering
8 个月I would like to know your opinion on hallucination in general. How we can control it in production? would you be writing on it in the future?