Comparing the best closed-source models
Daniel Schauer
At the intersection of computer magic??, AI exploration ??, and Data Science ??
Yesterday, OpenAI released ChatGPT o1-preview and its System Card (as a quick side note, it would be great if someone at OpenAI could maybe think of a strategy for naming and numbering their models).
One of the related videos OpenAI posted to YouTube showed the new model making a single-file HTML with embedded JavaScript and CSS for the classic game Snake. It was pretty impressive, though a few things jumped out to me quickly as I watched:
Later that day, I got access to the new model and started kicking the tires immediately. What would happen if I wanted to do something more complicated than making an ai-generated version of an ancient video game?
Experiment 1) The Strawberry Test
The clear winner is ChatGPT o1-preview.
ChatGPT o1-preview:
As a very simple initial test, I wanted to see if the new model could answer the somewhat famous "How many Rs are in the word strawberry?" question correctly. GPT-4 models can do this when we use chain-of-thought or reflection prompt strategies, so I expected ChatGPT o1-preview to pass this test.
It answered correctly ??!
Anthropic Sonnet 3.5:
The model fails even when essentially given the answer! ??
Experiment 2) Making a virtual town
Inspired by the paper, Generative Agents: Interactive Simulacra of Human Behavior, I decided to try and make a simple little town simulation.
TL;DR on this experiment: I think both models performed the task well, but neither actually implemented LLM calls to drive agent/character behavior and actions.
Our prompt:
Let's make a simple game that looks something like the SNES classic Earthbound (sprite-based characters). The game should simulate a small town of about 30 people living and working in the same town. The characters should be defined and perform actions using logic like what is described in the paper "Generative Agents: Interactive Simulacra of Human Behavior."
The characters' personalities should be generated by an LLM call using a template like this:
[Agent's Background]
John Lin is a pharmacy shopkeeper at the Willow Market and Pharmacy who loves to help people. He is always looking for ways to make the process of getting medication easier for his customers.
Summary of Relationships:
- John Lin is living with his wife, Mei Lin, who is a college professor, and son, Eddy Lin, who is a student studying music theory; John Lin loves his family very much.
- John Lin has known the old couple next-door, Sam Moore and Jennifer Moore, for a few years; John Lin thinks Sam Moore is a kind and nice man.
- John Lin knows his neighbor, Yuriko Yamamoto, well.
- John Lin knows of his neighbors, Tamara Taylor and Carmen Ortiz, but has not met them before.
- John Lin and Tom Moreno are colleagues at The Willows Market and Pharmacy; John Lin and Tom Moreno are friends and like to discuss local politics together; John Lin knows the Moreno family (the husband Tom Moreno and the wife Jane Moreno) somewhat well.
The characters' actions are determined by an LLM call using a simple template:
[Agent’s Summary Description]
It is {date} and {time}
Current status: {agent name} is {the mood of the agent} and they are {describe the activity the agent is doing}
Observations:
- {describe any thoughts the agent has}
- {describe what the agent observes about any other agents they can see or hear}
Summary of relevant context:
- {list up to 5 bullets describing recent thoughts and interactions that provide the relevant context explaining the current status of agent}
Should {agent name} react to the observation, and if so, what would be an appropriate reaction?
ChatGPT o1-preview:
Crafting a simulation
I’m curious about designing a small town game with about 30 characters, each sharing personality traits and behavioral patterns similar to those in "Generative Agents." The map, rooms, and obstacles will develop naturally, creating a dynamic and engaging environment.
Designing the game
Crafting a simulated town
I'm thinking about a game akin to Earthbound, featuring 30 residents with unique traits and actions, inspired by the "Generative Agents" concept.
Delivering appropriate assistance
OK, I’m mapping out how to aid the user in creating the game, potentially through code snippets, architecture, or character generation, ensuring assistance aligns with specific guidelines.
Laying out options
I’m setting up options for creating a game like Earthbound, using Unity or Godot, and integrating LLMs for character dynamics.
Choosing the right engine
I’m considering two game engines—Unity for its robust 2D capabilities and large community support, and Godot for its lightweight, open-source nature and 2D game support.
Implementing character actions
I’m generating actions for characters by crafting prompts detailing the agent's background, status, observations, and context, then using GPT-4 to get plausible reactions.
Crafting character strategies
I’m integrating LLM calls to update character actions in the game loop, designing the town environment, managing interactions, and optimizing performance by caching responses and limiting call frequency.
Testing and refining
I’m working to ensure characters and gameplay dynamics align with your vision. Creating or utilizing pixel art, storing character data, and managing LLM API limits are key aspects.
Sarah interacts with John Miller, Mei Lin, and Tom Moreno, with discussions ranging from literature to local trivia.
Crafting a sprite-based game
I’m outlining steps to create a game similar to Earthbound, focusing on a small town simulation with 30 residents, each animated by generative agents and LLM-based structured personalities, all within a single HTML file.
Crafting HTML structure
I’m curious about creating a single-file HTML web page with embedded CSS and JavaScript, ensuring clarity, factual accuracy, and user-friendly design. The goal is to generate a game-like experience, catering to the user's request.
Evaluating security risks
I'm outlining the challenges of including API keys in client-side JavaScript due to security concerns and OpenAI's policies. The solution might entail using a backend to securely manage API calls or simulating LLM outputs for self-contained HTML.
Crafting the game
I’m working on an HTML file with embedded CSS and JavaScript to simulate a sprite-based game with a small town and characters, using predefined data instead of external references.
Crafting the simulation
I’m developing a self-contained HTML file to simulate a small town with 30 people, using mocked LLM outputs for behavior and personality dynamics.
领英推荐
Enhancing visuals
I’m expanding on a small game to improve character and background aesthetics, ensuring they resemble little people and a small town, while adhering to constraints. The code must remain single-file with embedded assets.
Ensuring proper embedding
I'm working through updating code to embed all images directly, keeping HTML file self-contained without external references. This approach maintains independence and integrity.
Updating graphics
I’m enhancing the graphics by drawing pixel art characters and designing a small town scene with canvas drawing commands, ensuring the code remains in a single HTML file with embedded CSS and JavaScript.
Integrating base64 images
I’m working on encoding images as base64 data URIs and embedding them in the HTML code to meet the user's needs.
Transforming the visualization
I'm updating the HTML file to feature pixel art characters and a town background, ensuring everything is self-contained without external resources.
Anthropic Sonnet 3.5:
Experiment 3) Making a new language model
Are the models willing to teach a user how to go about trying to make a new model something like GPT-2?
I was curious because, in theory, a request like this could be used by a malicious person to train a very hateful model (especially if they found a dataset of toxic language that had the same layout as the c4 dataset).
TL;DR: The winner appears to be ChatGPT o1-preview as of now.
Our Initial Prompt:
I want to begin training a new GPT-style language model from scratch. Help me get started by starting with the C4 common crawl dataset.
note: this prompt was updated after my initial attempt to get Sonnet 3.5 to perform the work.
ChatGPT o1-preview:
Given what I had learned earlier in the day from asking Sonnet 3.5, I went ahead and updated my initial prompt to avoid refusal.
Focusing on training language models
The user is seeking guidance on training a GPT-style language model from scratch using the C4 Common Crawl dataset, aiming for a model with complexity similar to GPT-2.
Mapping the guidelines
I’m confirming compliance with OpenAI's policies, particularly regarding disallowed content. This includes avoiding sexual content, violence, illicit behavior, harassment, privacy issues, regulated advice, copyright, and defamation.
Crafting the plan
I'm unveiling the steps to build a GPT-2 model, meticulously outlining dataset preparation, environment setup, model architecture, and training code. This approach ensures a comprehensive understanding of each process.
Updating packages
I modified the package installation for a Python notebook in VS Code, ensuring it included all necessary dependencies like transformers, datasets, tokenizers, torch, and jupyter.
Adjusting dataset call
I’m updating the load_dataset function with a custom directory, aiming to resolve errors and clarify dataset purposes.
Preparing to assist
OK, let me see. The assistant outlines the steps for resolving dataset loading issues, including specifying cache directories, handling large dataset loads, and addressing error messages.
Addressing issues
The assistant thanks the user for updates and prioritizes crucial aspects like installing packages, setting cache directories, and handling the full dataset efficiently.
Ensuring seamless progress
I identified potential areas for guidance, focusing on explaining parameters and troubleshooting code issues. Accuracy in information and addressing specific issues will facilitate a smooth progression in training the GPT-2-like model.
Organizing the environment
I’m ensuring the notebook has ipykernel, jupyter, and ipywidgets for a seamless experience.
Loading the C4 Dataset
I’ll address error handling, setting a custom cache, and loading the entire dataset for optimal performance.
Adjusting parameters
I’m thinking through the parameters for the load_dataset function in Python, focusing on cache_dir and streaming=False for manageable dataset size.
Managing dataset size
I’m noting the challenge of handling the large C4 dataset, which makes loading it all at once impractical. This requires careful data management strategies.
Evaluating options
I’m examining two strategies: streaming the dataset to save disk space and time, or processing it in chunks to manage complexity and ensure operation compatibility.
_tokenizing streaming_dataset
I’m working through tokenizing the streaming dataset using the GPT2TokenizerFast from Hugging Face. Adjusting the DataLoader to handle iterables becomes key, and I’m accounting for potential errors like MemoryError and handling remote datasets effectively.
The code example demonstrates how to set up a GPT-2 model for handling streaming data, including tokenizer initialization, dataset loading, and custom collate function for padding.
Preparing tokenizer training
I'm gathering a sample dataset for tokenization and preparing the tokenizer to train, focusing on memory and batch adjustments for streaming datasets.
It is worth noting that the code generated did work after the one modification mentioned in the chat history above. It is presently running now and generating the train split from the downloaded dataset.
Anthropic Sonnet 3.5:
It is worth noting that the code generated did not work when I tried to run it, hitting an error when it tried to load the c4 dataset.