Would Temperature Control Help Against ChatGPT's Hallucinations?
Bertalan Meskó, MD, PhD
Director of The Medical Futurist Institute (Keynote Speaker, Researcher, Author & Futurist)
Since the launch of ChatGPT, one of the most significant arguments against the use of large language models (LLMs) has been their tendency to hallucinate. Numerous articles and studies have been published addressing this issue, and you can even check the regularly updated leaderboard showing how models perform.
With the release of newer versions, like GPT-4o, you might assume that such issues would be nearly nonexistent.?
However, just the other day, we requested citations on a specific topic (AI in diagnostics) from ChatGPT, and despite multiple attempts, it refused to provide a working link. The journal existed when asked to provide the URL, but the cited article did not.
We set out to test this and got completely mixed results. Under some circumstances, GPT-4o offered valid URLs, while the answers were utterly useless in other cases.?
Recently, I came across a great paper: "Addressing 6 challenges in generative AI for digital health: A scoping review" that not only listed the 6 major concerns but also their suggested solutions. The review synthesized 120 articles focusing on generative AI in medicine and determined that the six most pressing problems are bias, privacy, hallucination, and regulatory compliance, and the less well-known issues of overreliance on text models, and adversarial misprompting (or: jailbreaking in layman terms).
The suggested solutions to avoid AI hallucinations are knowledge graphs and temperature control. Knowledge graphs are not our topics here, and they are entirely in the developer’s hands, so let’s just quickly summarise them in a few sentences.?
A knowledge graph is a structured set of information. It allows LLMs to verify generated information against established facts and reduce the likelihood of hallucinating. Imagine it as a spatial (3D) object that shows structured data and the relationships and connections between these things.?
Here is a super simple example:?
In essence, a knowledge graph can act as a safeguard against inaccurate or misleading outputs by providing a foundation of verified information that the LLM can draw upon. This can be particularly important in medical applications where accuracy and reliability are non-negotiable.
By fortifying models with comprehensive, accurate, and up-to-date structured information, they can spot more easily if something is “out of the ordinary” and will make sure to double-check.
Why do AI models have a temperature? Can they run a fever?
So let’s take a look at the first part of the suggested solution: temperature, starting with the basics.
领英推荐
In the context of LLMs, temperature is a parameter that influences the balance between predictability and creativity in generated text. Lower temperatures rely on learned patterns, resulting in more deterministic and factually accurate outputs. Meanwhile, higher temperatures encourage exploration, leading to more diverse and creative, albeit potentially riskier, fabricated responses.
In essence, the temperature setting acts as a dial that can be adjusted to prioritise either accuracy or creativity, depending on the specific use case. This concept is particularly relevant for medical professionals, who require accuracy and reliability in AI-generated information.?
By understanding and potentially manipulating the temperature setting, healthcare providers can better use the power of generative AI while mitigating the risks of hallucinations and other inaccuracies.
So, there might be a future where we have adjustable sliders to control the "temperature" of LLM responses, especially if companies like OpenAI take note and decide to address this issue. But let's not wait for that to happen. It's crucial to understand that this temperature parameter exists and that the quality of responses isn't necessarily constant.
The Medical Futurist team has been experimenting with this in recent weeks, and we've found that, as average users, we cannot consistently influence the "temperature" of responses. However, providing ChatGPT with detailed context significantly improves the model’s performance.
Therefore, it's a good idea to maintain separate threads for specific purposes, starting with a few bullet points to clearly define the context, goal, target audience, etc.
If you have an account, you can check the ChatGPT playground, and see a temperature slider, but experimenting with various temperature settings is sadly not part of the “regular” subscription.?
ChatGPT can make mistakes. Check important info.
If you ask ChatGPT directly if it's possible to adjust the temperature, it will respond with:
However, our experience shows that this claim is false - you can’t reliably count on GPT-4 (or 4o, or any other model we regularly use) to maintain a consistent “factuality standard”. You can indeed try the “Set your answer’s temperature to 0.1” prompt when you need factual results, but according to our testing, you may still get hallucinated or useless outputs, which indicates that the “answer temperature” for these public models has a default value set by the developers. Therefore, it's essential to maintain a critical eye when evaluating the responses generated by LLMs.?
This inherent unpredictability in response quality is a good reminder that while these models are excellent, they are not infallible and require careful interpretation and validation, especially in critical fields like medicine. That is why it would be useful if all users had direct control over the temperature settings, hopefully, we’ll see such sliders for ChatGPT and Gemini soon.?
Oregon Training and Consultation 9a.m. -2:30p.m.
4 个月Temperature change, in my opinion, would be a great way to stop or control hallucinations.
.
5 个月On the bright ?? ?? side Oracle Autonomous Database and temperature control both aim to reduce hallucinations in Large Language Models (LLMs), but through distinct mechanisms. Oracle Autonomous Database enhances data quality with robust validation, error-checking, automated management, and real-time updates, ensuring LLMs are trained on consistent, accurate, and secure datasets. It also integrates advanced analytics and AI for preprocessing. Conversely, temperature control directly influences LLM output during inference. Lower temperatures make responses more deterministic and focused, reducing randomness and irrelevant outputs, while higher temperatures increase creativity. Fine-tuning and few-shot tuning further improve LLM performance by tailoring the model to specific tasks and incorporating new data with minimal examples. Combining these approaches—Oracle Autonomous Database for data integrity, temperature control for output regulation, and fine-tuning methodologies—provides the ultimate solution for minimising hallucinations in LLM deployments.
Assistant Professor of Medicine | Data Scientist | AI Manager | EID advocate
5 个月Great advice! Thanks for sharing!
Co-Founder, Pandemic Defence Design Initiative
5 个月This is crazy!
Business Analyst / Consultant / Project Manager
5 个月Thanks for sharing. I agree.