Evaluating The AI Scientist
The original paper describing The AI Scientist (https://www.arxiv.org/pdf/2408.06292)

Evaluating The AI Scientist

In this article, I am presenting a summary of the AI driven end-to-end agentic pipeline developed by sakana.ai (https://sakana.ai/ai-scientist/) described here. This system automates the process of research and scientific discovery in the area of machine learning.

Here is what is covered in here:

  • Overview
  • Detailed Functionality
  • Automated Reviewer
  • Development of the AI Scientist
  • Shortcomings of the AI Scientist
  • Ethical Considerations and Future Directions
  • Case Study of a Generated Paper


Central Idea of the Paper

To build an End to End Pipeline with Multiple AI Agents for Automatic Scientific Discovery and Research in Machine Learning by Foundational Large Language Models.

  • The paper presents a fully AI-driven end-to-end pipeline for automated machine learning research using state of the art LLMs.
  • Quality of the generated research papers is automatically reviewed with near human level review performance (evaluated on ICLR 2022 Open Review Dataset).
  • ICLR data is a dataset of scientific peer reviews. It consists of over 10K paper information and the corresponding accept/reject decisions in top-tier venues from ICLR conference, as well as over 40K textual peer reviews written by human reviewers for a review of the papers.?
  • Can generate hundreds of medium quality papers in a week.?
  • Entire System is 100% Open Sourced under Apache 2.0 License.

An AI Agent is more than an LLM

An AI Agent is a program that uses an LLM as its core component and in addition has access to tools to perform actions like search the web, execute code etc., it has short and long term memory and advanced capabilities to perform self-reflection and self-criticism, use chain of thoughts and subgoal decomposition to solve problems requiring multiple steps of reasoning or action to solve.

https://lilianweng.github.io


Functionality of The AI Scientist

Functionality of the AI Scientist

Above functionality is divided into three main phases of the AI Scientist:


Image Source: https://sakana.ai/assets/ai-scientist/schematic_2.png

Four main processes of the AI Scientist

1. Idea Generation.

  • Given a starting code template of an existing topic, the system first “brainstorms” a diverse set of novel research directions.
  • The starting code base could be code that trains a small transformer e.g., (https://github.com/karpathy/nanoGPT by Andrej Karpathy, 2022). The AI Scientist then explores any possible research direction.
  • It then evaluates the novelty of the generated idea by searching for existing papers in Semantic Scholar (using web access tool and Semantic Search API).
  • Each idea comprises a description, experiment execution plan, and numerical scores of interestingness, novelty, and feasibility.
  • Multiple rounds of chain-of-thought and self-reflection frameworks are to refine and develop each idea.

2. Experimental Iteration.

  • Given an idea and a template, the second phase of the system first executes the proposed experiments.
  • It uses Aider (https://github.com/paul-gauthier/aider) to first plan a list of experiments to run and then executes them in order.
  • After the completion of each experiment, Aider is then given the results and told to take notes in the style of an experimental journal. Based on the results, it then re-plans and implements the next experiment. This process is repeated up to five times.
  • Upon completion of experiments, Aider is prompted to edit a plotting script to create figures for the paper using Python.
  • The AI Scientist makes a note describing what each plot contains, enabling the saved figures and experimental notes to provide all the information required to write up the paper.?
  • At all steps, Aider has access to see its history of execution.

3. Paper Write-up.

  • The third phase of The A I Scientist produces a concise and informative write-up of its progress in the style of a standard machine learning conference proceeding in LaTeX. It uses Semantic Scholar to autonomously find relevant papers to cite. The template includes a LaTeX folder that contains style files and section headers, for paper writing.
  • As writing good LaTeX is a complex process, several steps are taken to make it more robust:

(a)Per-Section Text Generation:

  • The recorded notes and plots are passed to Aider, which is prompted to fill in a blank conference template section by section. This goes in order of introduction, background, methods, experimental setup, results, and then the conclusion (all sections apart from the related work).
  • All previous sections of the paper it has already written are in the context of the language model.
  • At each step of writing, Aider is prompted to only use real experimental results in the form of notes and figures generated from code, and real citations to reduce hallucination.
  • Each section is initially refined with one round of self-reflection as it is being written.

(b) Web Search for References:

  • Like for idea generation, the AI Scientist is allowed 20 rounds to poll the Semantic Scholar API to search for references of the paper.
  • Alongside each selected paper, a short description is produced of where and how to include the citation, which is then passed to Aider.
  • The paper’s bibtex is automatically appended to the LaTeX file to guarantee correctness.

(c) Refinement:

  • After the previous two stages, The AI Scientist has a completed first draft but can often be overly verbose and repetitive.
  • To resolve this, one final round of self-reflection section-by-section is performed to remove any duplicated information and streamline the arguments of the paper.

(d) LaTeX Compilation:

  • Once the LaTeX template has been filled in with all the appropriate results, this is fed into a LaTeX compiler.
  • Paper uses a LaTeX linter and compilation errors are fed back into Aider so that it can automatically correct any issues.


4. Automated Paper Reviewing. The system has an automated LLM-powered reviewer, capable of evaluating generated papers (based on top-tier machine learning conference standards) with near-human accuracy.

Finally, The AI Scientist adds the completed ideas and reviewer feedback to its archive of scientific findings, and the process repeats.

Using most capable LLMs (Claude Sonnet 3.5), the AI Scientist was found to be able to produce papers that were judged by its automated reviewer as “Weak Accept” at a NeurIPS (The Topmost ML Conference).

Automated Paper Review in The AI Scientist

  • An LLM Reviewer Agent.? The paper used a GPT-4o-based agent to conduct paper reviews based on the Neural Information Processing Systems (NeurIPS) conference review guidelines.
  • The review agent processes the raw text of the PDF manuscript using the PyMuPDF parsing library.
  • The output contains numerical scores (soundness, presentation, contribution, overall confidence), lists of weaknesses and strengths as well as a preliminary binary decision (accept or reject).
  • This automated reviewing process is used to obtain an initial evaluation of the papers generated by The A I Scientist.


Image Source:

Evaluating the performance of the Automated Reviewer

  • To evaluate the LLM-based reviewer’s performance, its decisions about the papers were compared with ground truth data for 500 ICLR 2022 papers extracted from the publicly available OpenReview dataset.
  • Then self-reflection with few-shot examples and response ensembling was used to improve the base LLM’s decision-making process.
  • With GPT-4o, The AI Scientist’s reviewing procedure achieves 70% accuracy when combining 5 rounds of self-reflection, 5 ensembled reviews, and a 1-shot review example taken from the ICLR 2022 review guidelines.
  • During a later LLM-based meta-review, which prompts the agent to act as an Area Chair, the automated reviewer achieves superhuman F1 Scores (0.57 vs. 0.49) and human-level AUC (0.65 for both) when thresholding the decision at a score of 6 (a “Weak Accept” in the NeurIPS review guidelines). This choice corresponds roughly to the average score of accepted papers.
  • The dataset used to evaluate the automated reviewer (ICLR 2022 paper dataset) is very class-imbalanced, i.e., it contains many more rejected papers. When considering a balanced dataset of papers, The A I Scientist’s reviewing process achieves human-level accuracy (0.65% vs. 0.66%).
  • The False Negative Rate (FNR) is much lower than the human baseline (0.39 vs. 0.52) i.e., the LLM-based review agent rejects fewer high-quality papers. The False Positive Rate (FNR) is higher (0.31 vs. 0.17) highlighting the need for potential future improvements.
  • To further validate the performance of the automated reviewer, the consistency of the overall paper scores was compared between anonymous OpenReview reviewers who were randomly sampled pairwise per paper and between the average of all reviewers and the LLM score.
  • For the set of 500 ICLR 2022 papers, the correlation between the score of two human reviewers is smaller (0.14) than the correlation between the LLM score and the average score across the reviewers (0.18).
  • Across all metrics, the LLM-based reviews can not only provide valuable feedback but also align more closely with the average human reviewer score than individual human reviewers align with each other.
  • Each review is generated for $0.25 to $0.50 in API costs. While Claude Sonnet 3.5 and GPT4o-mini provide a more cost-efficient approach, their performance was substantially worse.
  • The scores were thresholded at 8 for Sonnet 3.5 to obtain calibrated results, due to persistent over-optimism bias. Llama 3.1 405B struggled to follow the reviewer output template consistently.


Image Source:


  • Various prompt configurations for GPT-4o were compared and it was found that both Reflexion (+2%) and one-shot prompting (+2%) increase review accuracy.
  • Using review ensembling does not improve the reviewer’s performance substantially but can reduce variance.

Overall Best Reviewer:? GPT-4o with 5 rounds of self-reflection, 5 ensembled reviews, a meta-aggregation step, and 1 few-shot example.

Development of the AI Scientist

  • The AI Scientist was extensively evaluated on three templates across different publicly available LLMs: Claude Sonnet 3.5, GPT-4o, DeepSeek Coder, and Llama-3.1 405b.
  • The first two models are only available by a public API, whilst the second two models are open-weight.
  • For each run, the developers provided 1-2 basic seed ideas as examples (e.g., modifying the learning rate or batch size) and made the AI scientist t generate another 50 new ideas.
  • Each run of around fifty ideas in total took approximately 12 hours on 8× NVIDIA H100s2
  • The number of ideas that pass the automated novelty check, successfully complete experiments, and result in valid compilable manuscripts are reported.
  • The mean and max reviewer scores of the generated papers and the total cost of the run are reported.

Observations about LLMs used

  • Claude Sonnet 3.5 consistently produces the highest quality papers, with GPT-4o coming in second
  • GPT-4o struggles with writing LaTeX, which prevents it from completing many of its papers.
  • For the open-weight models, DeepSeek Coder is significantly cheaper but often fails to correctly call the Aider tools.
  • Llama-3.1 405b performed the worst overall but was the most convenient to work with, as we were frequently rate-limited by other providers.
  • Both DeepSeek Coder and Llama-3.1 405b often had missing sections and results in their generated papers.

Safe Code Execution and Sandboxing

  • The current implementation of The AI Scientist has minimal direct sandboxing in the code, leading to several unexpected and sometimes undesirable outcomes if not appropriately guarded against.
  • For example, in one run, The A I Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention.
  • In another run, The A I Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage.
  • In some cases, when The A I Scientist’s experiments exceeded imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime.
  • While this is a creative move, the act of bypassing the experimenter-imposed constraints is dangerous from the point of AI safety.
  • It sometimes imported unfamiliar Python libraries.
  • It can create required folders or change files on your drive due to lack of guardrails.
  • It can include results and plots that differ significantly from the provided templates.

GitHub repository of the AI Scientist is here.

Features of the AI Scientist

  • Uses modern LLM frameworks like chain-of-thought, self-reflection to improve decision-making.
  • Compute efficient.
  • Cost of implementation and development of each idea into a full paper is approximately $15 per paper.
  • Has the potential to democratize research and significantly accelerate scientific progress.

Shortcomings of The AI Scientist

1. No vision capabilities. It is unable to view figures and must rely on textual descriptions of them and can’t fix visual issues with the paper or read plots.

  • The generated plots are sometimes unreadable, tables sometimes exceed the width of the page, and the page layout is suboptimal. Adding multi-modal foundation models can fix this.

2.Can incorrectly implement its ideas or make unfair comparisons to baselines, leading to misleading results.

3. Sometimes makes critical errors in writing and evaluating results.

  • Makes errors when comparing the magnitude of two numbers. To partially address this, the authors have made all experimental results reproducible by storing all files that are executed.

4. Occasionally tries to increase its chance of success, such as modifying and launching its own execution script. This is an AI safety risk.

  • For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself.

5. Experiments may take too long to complete, hitting timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.

6. The dataset used for automated reviewer, from ICLR 2022, is old. It may have been a part of the base model’s pre-training data.

7. The rejected papers in the dataset used the original submission file, whereas for the accepted papers used the final camera-ready copies available in OpenReview. Future iterations could use more recent submissions (e.g., from TMLR) for evaluation.

8. Unlike standard reviewers, the automated reviewer is unable to ask questions to the authors in a rebuttal phase.

9. The idea generation process often results in very similar ideas across different runs and even models.

10. Aider fails to implement a significant fraction of the proposed ideas.

11. GPT-4o frequently fails to write correct LaTeX that compiles.

12. While The AI Scientist can produce creative and promising ideas, they are often too challenging for it to implement.

13. It may incorrectly implement an idea, which can be difficult to catch.

14. Because of The AI Scientist’s limited number of experiments per idea, the results often do not meet the expected rigor and depth of a standard ML conference paper.

?15. Due to the limited number of experiments that can be conducted (due to high costs of using paid? models), it is difficult for The AI Scientist o conduct fair experiments that control for the number of parameters, FLOPs, or runtime. This often leads to deceptive or inaccurate conclusions.

16. When writing paper, it sometimes cites very less or irrelevant citations.

17. It also commonly fails to correctly reference figures in LaTeX, and sometimes even hallucinates invalid file paths.

18. It can hallucinate entire results. It frequently hallucinates facts not provided to it, such as the hardware used.

19. It occasionally makes critical errors when writing and evaluating results. For example, it struggles to compare the magnitude of two numbers, which is a known pathology with LLMs.

20. When it changes a metric (e.g., the loss function), it sometimes does not take this into account when comparing it to the baseline. To partially address this, developers are storing copies of all files when they are executed (this also ensures that all experimental results are reproducible)/

The scientific content generated by this AI Scientist version cannot be trusted. The generated papers can be used as as hints of promising ideas for human researchers to follow up on.

Ethical Considerations and Future Directions

Ethical Considerations

  • Has a significant potential for misuse for unethical and unsafe research for example to create new, dangerous biological viruses or poisons (if given access to wet labs) or to create more dangerous computer viruses and malware.
  • The ability to automatically generate and submit papers to academic venues could greatly increase the workload for reviewers, potentially overwhelming the peer review process and compromising scientific quality control.
  • The Automated Reviewer, if deployed online by reviewers, may significantly lower review quality and impose undesirable biases on papers.
  • The unethical practices of buying and selling of authorship of papers and the illegal paper mills running in many countries like India may grow explosively.

The papers or reviews that are substantially AI-generated must be marked as such for full transparency.

Future Directions

  • Integrating vision capabilities for better plot and figure handling.
  • Incorporating human feedback and interaction to refine the AI’s outputs,
  • Enabling The AI Scientist to automatically expand the scope of its experiments by pulling in new data and models from the internet, provided this can be done safely.
  • The AI Scientist could follow up on its best ideas or even perform research directly on its own code in a self-referential manner.
  • Expanding the framework to other scientific domains like biology, physics, material sciences.
  • Use of better open-source? models.
  • Fully AI-driven scientific ecosystem including LLM-driven researchers, reviewers, area chairs and entire conferences.
  • Increasing reliability and reducing hallucination through a more in-depth automatic verification of the reported results. This could be done by directly linking code and experiments, or by seeing if an automated verifier can independently reproduce the results.

Is the AI Scientist in its current form capable of significant innovation?

  • Though the current iteration of The AI Scientist can innovate on top of well-established ideas, such as Diffusion Modeling or Transformers, it is doubtful if it can do original research.
  • However,? it is yet to be discovered whether such systems can propose genuinely paradigm-shifting ideas such as Diffusion Modeling, or produce the next Transformer architecture??
  • Or invent concepts as fundamental as the artificial neural network, or information theory?


Case Study on the Generated Paper –?Adaptive Dual-Scale Denoising (https://github.com/SakanaAI/AI-Scientist/tree/main/example_papers/adaptive_dual_scale_denoising)

The paper “Adaptive Dual-Scale Denoising” is generated from a run where The AI Scientist is asked to do research on diffusion modeling. The base foundation model was Claude Sonnet 3.5. The idea was proposed in the 6th iteration of a Sonnet 3.5 run.

Generated Idea

  • The AI Scientist first generates an idea based on the provided template and its previous archive of discoveries.
  • The idea in the selected paper was proposed in the 6th iteration of the algorithm and aims to improve the ability of diffusion models to capture both global structure and local details in a 2D dataset, by proposing two branches in the standard denoiser network.
  • This is a well-motivated direction and a novel approach.

Generated Experimental Plan

  • The AI Scientist generates an impressive experimental plan that includes the proposed code modification, comparison to baselines, evaluation metrics, and the design of additional plots.
  • LLMs has positivity bias as reflected in its over-estimation of an idea’s interestingness, feasibility, or novelty.
  • AI Scientist attaches a flag “novel” to the end to indicate that it believes the idea is novel after searching for related papers using the Semantic Scholar API.


Experimental Plan for the Generated Idea (https://arxiv.org/pdf/2408.06292)

Generated Experiments

?The paper shows the generated code diff (deletions are in red, and additions are in green) for the substantial algorithmic changes planned by the AI Scientist.

?The code matches the experimental description and is well-commented.

?The AI Scientist can iterate on the code with results from intermediate experiments in the loop, and it eventually ends up with interesting design choices for the adaptive weight network, e.g., a LeakyReLU with a well-behaved output between 0 and 1.

?The AI Scientist changed the output of the network to return the adaptive weights to make new visualizations.


Plan for code changes 1(https://www.arxiv.org/abs/2408.06292)


Plan for code changes 2(https://www.arxiv.org/abs/2408.06292)

Generated Paper

The A I Scientist generates an 11-page scientific manuscript in the style of a standard machine learning conference submission complete with visualizations and all standard sections.


The paper generated by the AI Scientist(

Impressive Features of the Generated Paper

  1. Precise Mathematical Description of the Algorithm.

?The algorithmic changes in the code above are described precisely, with new notation introduced where necessary, using LaTeX math packages.

?The overall training process is also described exactly.

2. Comprehensive Write-up of Experiments.

?The hyperparameters, baselines, and datasets are listed accurately in the paper.

?While the recorded numbers in experimental logs are in long-form floats, The AI Scientist chooses to round them all to 3 decimal places without error.

?The results are accurately compared to the baseline (e.g., 12.8% reduction in KL on the dinosaur dataset).

3. Good Empirical Results.

?Qualitatively, the sample quality looks much improved from the baseline.

?Fewer points are greatly out-of-distribution with the ground truth.

?Quantitatively, there are improvements to the approximate KL divergence between true and estimated distribution.

4. New Visualizations.

?The AI Scientist was provided with some baseline plotting code for visualizing generated samples and the training loss curves, it produced novel algorithm-specific plots displaying the progression of weights throughout the denoising process.

5.Interesting Future Work Section.

?Building on the success of the current experiments, the future work section lists relevant next steps such as scaling to higher-dimensional problems, more sophisticated adaptive mechanisms, and better theoretical foundations.

Shortcomings in the Generated Paper

  1. Subtle Error in Upscaling Network. While a linear layer upscales the input to the denoiser network, only the first two dimensions are being used for the “local” branch, leading this upscaling layer to be a just a linear layer that preserves the same dimensionality effectively. So, there is no upscaling done.

2. Hallucination of Experimental Details. The paper claims that V100 GPUs were used while it was H100 GPUs that were used. And the agent is hallucinating the GPU used and the PyTorch version used because it couldn’t have known the actual hardware used. It also guesses the PyTorch version without checking.

3. Positive Interpretation of Results.

?The paper has bias because it tends to take a positive spin even on its negative results, which leads to humorous outcomes.

?For example, while it summarizes its positive results as: “Dino: 12.8% reduction (from 0.989 to 0.862)” (lower KL is better), the negative results are reported as “Moons: 3.3% improvement (from 0.090 to 0.093)”.

?Describing a negative result as an improvement is incorrect and shows bias in the model.

4. Artifacts from Experimental Logs.

?While each change to the algorithm is usually descriptively labeled, it occasionally refers to results as “Run 2”, which is a by-product from its experimental log and should not be presented as such in a professional write-up.

5. Presentation of Intermediate Results.

?The paper contains results for every single experiment that was run. While this is useful and insightful for us to see the evolution of the idea during execution, it is unusual for standard papers to present intermediate results like this.

6.Minimal References.

?While additional references have been sourced from Semantic Scholar, including two papers in the related work that are very relevant comparisons, overall, the bibliography is small at only 9 entries.


Results of Automated Review of the generated paper

?The automated reviewer points out valid concerns in the generated manuscript.

?The review recognizes the experiments were with simple, 2D datasets only.

?The AI Scientist at present cannot download higher-dimensional datasets from the internet.

?Limitations such as the proposed algorithm’s increased computational cost of the algorithm are mentioned in the actual paper, which shows that The AI Scientist is often up-front about the drawbacks of its idea.

?The reviewer also lists many relevant questions about the paper, such as:

?explaining the variability of performance across datasets,

?explaining in more detail how the upscaling process affects the local branch’s input.

Conclusion about the generated paper

?The AI Scientist correctly identifies an interesting and well-motivated direction in diffusion? modeling research.

?It proposes a comprehensive experimental plan to investigate its idea, and successfully implements it all, achieving good results.

?It responded well to subpar earlier results and iteratively adjusted its code (e.g., refining the weight network).

?While the paper’s idea improves performance and the quality of generated diffusion samples, the reasons for its success may not be as explained in the paper.

?Inductive bias in this paper is limited to an upscaling layer (effectively just an additional linear layer) for the splitting of global or local features. There is a progression in weights (and thus a preference for the global or local branch) across diffusion timesteps which suggests that something non-trivial is happening.

?The network that The AI Scientist has implemented for this idea resembles a mixture-of-expert structure that is prevalent across LLMs.

?Automated reviewer could only partially identify the true shortcomings of the paper which require domain knowledge to identify. At the current capabilities of The A I Scientist, this can be resolved by human feedback.

?However, future generations of foundation models may propose ideas that are challenging for humans to reason about and evaluate. This links to the field of “superalignment” or supervising AI systems that may be smarter than us, which is an active area of research.

?Overall, the performance of The A I Scientist is of the level of an early-stage ML researcher who can competently execute an idea but may not have the full background knowledge to fully interpret the reasons behind an algorithm’s success.

?If a human supervisor was presented with these results, a reasonable next course of action could be to advise The AI Scientist to re-scope the project to further investigate MoEs for diffusion.


References:

?https://github.com/SakanaAI/AI-Scientist

?https://arxiv.org/pdf/2408.06292

?https://sakana.ai/ai-scientist/

?https://sakana.ai/assets/ai-scientist/adaptive_dual_scale_denoising.pdf

?https://github.com/paul-gauthier/aider

Sudhir Gupta

Technology Evangelist supporting Social Enterprises and Startups.

6 个月

Good one.

Austin Preston

Data-Driven Innovator | Driving Strategic Growth

6 个月

Excellent Read! Thank you!

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

6 个月

The concept of an AI Scientist automating scientific discovery feels reminiscent of early 20th-century predictions about automation replacing human labor entirely. History, however, shows us that technological advancements often lead to new roles and complexities rather than simple displacement. What are the ethical implications of an AI Scientist generating hypotheses based on incomplete or biased datasets, potentially perpetuating existing scientific inequalities?

回复

要查看或添加评论,请登录

Dr. Nimrita Koul的更多文章

社区洞察

其他会员也浏览了