Evaluating The AI Scientist
Dr. Nimrita Koul
Principal Investigator | AI Start Up Consultant| Tech Speaker | Certified Instructor l Author | Ambassador-Women Techmakers, WIDS| Speaker GHC2023, GHCI24 | LinkedIn Top Voice DS|IBM Generative AI Faculty| GHCI'24Scholar
In this article, I am presenting a summary of the AI driven end-to-end agentic pipeline developed by sakana.ai (https://sakana.ai/ai-scientist/) described here. This system automates the process of research and scientific discovery in the area of machine learning.
Here is what is covered in here:
Central Idea of the Paper
To build an End to End Pipeline with Multiple AI Agents for Automatic Scientific Discovery and Research in Machine Learning by Foundational Large Language Models.
An AI Agent is more than an LLM
An AI Agent is a program that uses an LLM as its core component and in addition has access to tools to perform actions like search the web, execute code etc., it has short and long term memory and advanced capabilities to perform self-reflection and self-criticism, use chain of thoughts and subgoal decomposition to solve problems requiring multiple steps of reasoning or action to solve.
Functionality of The AI Scientist
Above functionality is divided into three main phases of the AI Scientist:
Four main processes of the AI Scientist
1. Idea Generation.
2. Experimental Iteration.
3. Paper Write-up.
(a)Per-Section Text Generation:
(b) Web Search for References:
(c) Refinement:
(d) LaTeX Compilation:
4. Automated Paper Reviewing. The system has an automated LLM-powered reviewer, capable of evaluating generated papers (based on top-tier machine learning conference standards) with near-human accuracy.
Finally, The AI Scientist adds the completed ideas and reviewer feedback to its archive of scientific findings, and the process repeats.
Using most capable LLMs (Claude Sonnet 3.5), the AI Scientist was found to be able to produce papers that were judged by its automated reviewer as “Weak Accept” at a NeurIPS (The Topmost ML Conference).
Automated Paper Review in The AI Scientist
Evaluating the performance of the Automated Reviewer
Overall Best Reviewer:? GPT-4o with 5 rounds of self-reflection, 5 ensembled reviews, a meta-aggregation step, and 1 few-shot example.
Development of the AI Scientist
Observations about LLMs used
Safe Code Execution and Sandboxing
GitHub repository of the AI Scientist is here.
Features of the AI Scientist
Shortcomings of The AI Scientist
1. No vision capabilities. It is unable to view figures and must rely on textual descriptions of them and can’t fix visual issues with the paper or read plots.
2.Can incorrectly implement its ideas or make unfair comparisons to baselines, leading to misleading results.
3. Sometimes makes critical errors in writing and evaluating results.
4. Occasionally tries to increase its chance of success, such as modifying and launching its own execution script. This is an AI safety risk.
5. Experiments may take too long to complete, hitting timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.
6. The dataset used for automated reviewer, from ICLR 2022, is old. It may have been a part of the base model’s pre-training data.
7. The rejected papers in the dataset used the original submission file, whereas for the accepted papers used the final camera-ready copies available in OpenReview. Future iterations could use more recent submissions (e.g., from TMLR) for evaluation.
8. Unlike standard reviewers, the automated reviewer is unable to ask questions to the authors in a rebuttal phase.
9. The idea generation process often results in very similar ideas across different runs and even models.
10. Aider fails to implement a significant fraction of the proposed ideas.
11. GPT-4o frequently fails to write correct LaTeX that compiles.
12. While The AI Scientist can produce creative and promising ideas, they are often too challenging for it to implement.
13. It may incorrectly implement an idea, which can be difficult to catch.
14. Because of The AI Scientist’s limited number of experiments per idea, the results often do not meet the expected rigor and depth of a standard ML conference paper.
?15. Due to the limited number of experiments that can be conducted (due to high costs of using paid? models), it is difficult for The AI Scientist o conduct fair experiments that control for the number of parameters, FLOPs, or runtime. This often leads to deceptive or inaccurate conclusions.
16. When writing paper, it sometimes cites very less or irrelevant citations.
17. It also commonly fails to correctly reference figures in LaTeX, and sometimes even hallucinates invalid file paths.
18. It can hallucinate entire results. It frequently hallucinates facts not provided to it, such as the hardware used.
19. It occasionally makes critical errors when writing and evaluating results. For example, it struggles to compare the magnitude of two numbers, which is a known pathology with LLMs.
20. When it changes a metric (e.g., the loss function), it sometimes does not take this into account when comparing it to the baseline. To partially address this, developers are storing copies of all files when they are executed (this also ensures that all experimental results are reproducible)/
The scientific content generated by this AI Scientist version cannot be trusted. The generated papers can be used as as hints of promising ideas for human researchers to follow up on.
Ethical Considerations and Future Directions
Ethical Considerations
领英推荐
The papers or reviews that are substantially AI-generated must be marked as such for full transparency.
Future Directions
Is the AI Scientist in its current form capable of significant innovation?
Case Study on the Generated Paper –?Adaptive Dual-Scale Denoising (https://github.com/SakanaAI/AI-Scientist/tree/main/example_papers/adaptive_dual_scale_denoising)
The paper “Adaptive Dual-Scale Denoising” is generated from a run where The AI Scientist is asked to do research on diffusion modeling. The base foundation model was Claude Sonnet 3.5. The idea was proposed in the 6th iteration of a Sonnet 3.5 run.
Generated Idea
Generated Experimental Plan
Generated Experiments
?The paper shows the generated code diff (deletions are in red, and additions are in green) for the substantial algorithmic changes planned by the AI Scientist.
?The code matches the experimental description and is well-commented.
?The AI Scientist can iterate on the code with results from intermediate experiments in the loop, and it eventually ends up with interesting design choices for the adaptive weight network, e.g., a LeakyReLU with a well-behaved output between 0 and 1.
?The AI Scientist changed the output of the network to return the adaptive weights to make new visualizations.
Generated Paper
The A I Scientist generates an 11-page scientific manuscript in the style of a standard machine learning conference submission complete with visualizations and all standard sections.
Impressive Features of the Generated Paper
?The algorithmic changes in the code above are described precisely, with new notation introduced where necessary, using LaTeX math packages.
?The overall training process is also described exactly.
2. Comprehensive Write-up of Experiments.
?The hyperparameters, baselines, and datasets are listed accurately in the paper.
?While the recorded numbers in experimental logs are in long-form floats, The AI Scientist chooses to round them all to 3 decimal places without error.
?The results are accurately compared to the baseline (e.g., 12.8% reduction in KL on the dinosaur dataset).
3. Good Empirical Results.
?Qualitatively, the sample quality looks much improved from the baseline.
?Fewer points are greatly out-of-distribution with the ground truth.
?Quantitatively, there are improvements to the approximate KL divergence between true and estimated distribution.
4. New Visualizations.
?The AI Scientist was provided with some baseline plotting code for visualizing generated samples and the training loss curves, it produced novel algorithm-specific plots displaying the progression of weights throughout the denoising process.
5.Interesting Future Work Section.
?Building on the success of the current experiments, the future work section lists relevant next steps such as scaling to higher-dimensional problems, more sophisticated adaptive mechanisms, and better theoretical foundations.
Shortcomings in the Generated Paper
2. Hallucination of Experimental Details. The paper claims that V100 GPUs were used while it was H100 GPUs that were used. And the agent is hallucinating the GPU used and the PyTorch version used because it couldn’t have known the actual hardware used. It also guesses the PyTorch version without checking.
3. Positive Interpretation of Results.
?The paper has bias because it tends to take a positive spin even on its negative results, which leads to humorous outcomes.
?For example, while it summarizes its positive results as: “Dino: 12.8% reduction (from 0.989 to 0.862)” (lower KL is better), the negative results are reported as “Moons: 3.3% improvement (from 0.090 to 0.093)”.
?Describing a negative result as an improvement is incorrect and shows bias in the model.
4. Artifacts from Experimental Logs.
?While each change to the algorithm is usually descriptively labeled, it occasionally refers to results as “Run 2”, which is a by-product from its experimental log and should not be presented as such in a professional write-up.
5. Presentation of Intermediate Results.
?The paper contains results for every single experiment that was run. While this is useful and insightful for us to see the evolution of the idea during execution, it is unusual for standard papers to present intermediate results like this.
6.Minimal References.
?While additional references have been sourced from Semantic Scholar, including two papers in the related work that are very relevant comparisons, overall, the bibliography is small at only 9 entries.
Results of Automated Review of the generated paper
?The automated reviewer points out valid concerns in the generated manuscript.
?The review recognizes the experiments were with simple, 2D datasets only.
?The AI Scientist at present cannot download higher-dimensional datasets from the internet.
?Limitations such as the proposed algorithm’s increased computational cost of the algorithm are mentioned in the actual paper, which shows that The AI Scientist is often up-front about the drawbacks of its idea.
?The reviewer also lists many relevant questions about the paper, such as:
?explaining the variability of performance across datasets,
?explaining in more detail how the upscaling process affects the local branch’s input.
Conclusion about the generated paper
?The AI Scientist correctly identifies an interesting and well-motivated direction in diffusion? modeling research.
?It proposes a comprehensive experimental plan to investigate its idea, and successfully implements it all, achieving good results.
?It responded well to subpar earlier results and iteratively adjusted its code (e.g., refining the weight network).
?While the paper’s idea improves performance and the quality of generated diffusion samples, the reasons for its success may not be as explained in the paper.
?Inductive bias in this paper is limited to an upscaling layer (effectively just an additional linear layer) for the splitting of global or local features. There is a progression in weights (and thus a preference for the global or local branch) across diffusion timesteps which suggests that something non-trivial is happening.
?The network that The AI Scientist has implemented for this idea resembles a mixture-of-expert structure that is prevalent across LLMs.
?Automated reviewer could only partially identify the true shortcomings of the paper which require domain knowledge to identify. At the current capabilities of The A I Scientist, this can be resolved by human feedback.
?However, future generations of foundation models may propose ideas that are challenging for humans to reason about and evaluate. This links to the field of “superalignment” or supervising AI systems that may be smarter than us, which is an active area of research.
?Overall, the performance of The A I Scientist is of the level of an early-stage ML researcher who can competently execute an idea but may not have the full background knowledge to fully interpret the reasons behind an algorithm’s success.
?If a human supervisor was presented with these results, a reasonable next course of action could be to advise The AI Scientist to re-scope the project to further investigate MoEs for diffusion.
Technology Evangelist supporting Social Enterprises and Startups.
6 个月Good one.
Data-Driven Innovator | Driving Strategic Growth
6 个月Excellent Read! Thank you!
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
6 个月The concept of an AI Scientist automating scientific discovery feels reminiscent of early 20th-century predictions about automation replacing human labor entirely. History, however, shows us that technological advancements often lead to new roles and complexities rather than simple displacement. What are the ethical implications of an AI Scientist generating hypotheses based on incomplete or biased datasets, potentially perpetuating existing scientific inequalities?