A.I. nears human-level forecasting quality
Lumina - makers of Analytica
Our software will help you map out your problems, discover what matters the most, and make better decisions faster.
Probabilistic forecasts are integral to decision-making processes, utilizing either statistical tools or human judgment to anticipate future events. While statistical forecasts rely on historical data under the assumption of its representativeness, judgmental forecasts leverage human expertise, domain knowledge, and intuitive interpretations. This Analytica blog post delves into the intricacies of judgmental forecasting through a challenging real-world example.
Several websites, including Metaculus , Good Judgment Inc , Infer , Polymarket , and Manifold, encourage individuals to provide probability estimates for geopolitical and technological events. These events eventually resolve (either occur or not by a specified date), which makes it possible to measure the quality of the earlier forecasts. These sites and related academic experiments have found that a small percentage of human forecasters consistently outperform other people by a substantial margin which has given rise to the title of "superforecaster". In addition, as one would expect, aggregations of all submitted forecasts consistently perform better than individuals.?
LLMs are able to produce probabilistic estimates of future events when prompted appropriately. Smaller LLMs like GPT-3.5, Gemini Pro, Claude 2.1, Mistral-8x7B, Llama-2-13B perform horribly on these assessment tasks. However, the largest LLMs like GPT-4-Turbo demonstrate an impressive ability to make assessments, yet fall far short of human-level quality.
领英推荐
Researchers from 美国加州大学伯克利分校 created a system that makes subjective probabilistic assessments of binary-valued geopolitical questions, described in their recent paper “Approaching human-level forecasting with language models”. The system processes an assessment question in multiple stages, using LLMs heavily as subroutines in each of the stages. First, it gathers relevant news articles from news feeds, then reasons about and weighs them over several passes of LLM prompting, and finally aggregates all the information into a single probability estimate. Across all test questions, their system performs close to but slightly inferior to crowd-aggregated human scores. In cases where the crowd estimates are highly uncertain (between 0.3 and 0.7, which accounts for >50% of all questions), it does better than human crowd-sourced estimates, but performs worse when crowd estimates are very certain. Despite this form of under confidence, its estimates are very well calibrated.
This work provides yet another hint at how we can expect A.I. advancements to transform? model-based decision making and the field of decision analysis.