Testing LLMs for Web and Game Development

Testing LLMs for Web and Game Development

As an experiment to explore the capabilities of AI-assisted development, I recently completed a project building a responsive website hosting 11 different 2D JavaScript games. What makes this project unique? Every single line of code was generated by Large Language Models (LLMs) – specifically for analyzing images: Pixtral-12b-2409, GPT-4o-mini, GPT-4o; for game ideas: Mistral-large-latest, GPT-4o, o1-mini, o1-preview; and for writing the code: o1-mini, o1-preview, Claude-3.5-Sonnet-20241022, Claude-3.5-Sonnet-20240620. Here's what I learned about the current state of AI game development and web development, along with specific recommendations for improving these LLMs to make them more viable for real-world use.


Project Origins and Motivation

This journey began at the Mistral AI Hackathon in London on October 5-6, 2024, where I built a simple application that generated game concepts from screenshots of other games. You can find my project from the hackathon on GitHub here: gamescribe. The initial results were surprisingly promising: Mistral-large, Claude 3.5 Sonnet, GPT-4o, o1-mini, and o1-preview were quite good at generating game ideas, including details for core concept, gameplay, scoring system, engagement mechanics, worlds, and controls/UI. Claude 3.5 Sonnet was especially good at producing rough, playable prototypes that captured core mechanics. I was using Anthropic's artifacts feature, which allows running JavaScript code directly in their application. This success sparked a larger question: could AI help build a complete, hosted gaming platform from scratch?

I expanded the scope beyond just ideation, aiming to test AI capabilities across a realistic development workflow:

  • Generate original game concepts from 30 different classic game screenshots, including examples like Snake, Pacman, Frogger, Tetris, Tic-tac-toe, and 2048
  • Build and host a website for these games
  • Implement responsive design for desktop and mobile
  • Create mobile touch controls (which in some cases were very clumsy)
  • Generate consistent game assets and aesthetics (very simple, without using sprites or JavaScript game libraries like Phaser 3)
  • Manage full deployment, all without writing code manually
  • The games didn't have sound effects or music

The goal was to push these AI tools through real-world challenges—from creative ideation to technical implementation—and see just how far they could go.


Project Overview

The project involved a three-phase approach to create browser-based 2D games:

  1. Game Analysis: Used Pixtral-12b, GPT-4o-mini, and GPT-4o to analyze existing games and extract core mechanics.
  2. Concept Generation: Leveraged Mistral-large, Claude 3.5 Sonnet, GPT-4o, o1-mini, and o1-preview to create new game ideas, including details for core concept, gameplay, scoring system, engagement mechanics, worlds, and controls/UI, by combining elements from classic games like Snake and Tic-tac-toe.
  3. Implementation: Employed o1-mini, o1-preview, and Claude-3.5-Sonnet-20241022 to build working games.

The end product? A collection of 11 games with a responsive website built using FastHTML, an HTMX-based Python library. The games were single-page 2D JavaScript games using simple HTML, CSS, and JavaScript, without the use of any game engine like Phaser 3. All the game thumbnails and images on the website were generated using GPT-4o. The website was designed with a focus on simple HTML, CSS, and JavaScript.

?? You can check the website and try the games here: www.zmey.co.uk


What Worked Well

  • Built a functional website using FastHTML and Cursor (with Claude 3.5 Sonnet).
  • Successfully created 11 playable games that are playable on Desktop and Mobile in the browser, with zero manual coding.
  • The website looks quite polished and achieved a responsive design that works well on both desktop and mobile platforms.
  • Building the website took 3-4 hours, and each game took between 2-4 hours to develop.
  • Delivered somewhat original game concepts and implemented basic functionality without external libraries (with some bugs, e.g., after specific levels, the new levels are not unique).


Challenges & Observations

Claude 3.5 Sonnet (Web Interface)

  • Context Limitations: The model had an output limit of 2k tokens, which practically made it impossible to build any of the games that required between 5-8k tokens.
  • Inconsistencies: If given a game produced by o1-preview or o1-mini with some bugs, the model was unable to solve the bugs or suggest meaningful improvements.
  • Repeated Suggestions: The model very frequently suggested the same code for specific functions, such as detecting collisions, which was identical to the existing one but sometimes missing a comment or adding one, claiming it fixed some bug.
  • Bug Detection Issues: The model was generally unable to detect any bugs in code generated by o1-preview or o1-mini, even when provided with a description of the bug and how it was visible in the UI, as long as there were no direct errors in the console.

Claude 3.5 Sonnet (API)

  • Context Usage: The model, although theoretically capable and with max_tokens set to 8192, never returned a game that exceeded 2k tokens. It appears there is a significant limitation with the output context length on all Anthropic models.
  • Iteration Process Challenges: Using the API made the process cumbersome. Despite trying for some time, I never observed better results compared to the UI. Eventually, I gave up due to the lack of progress and the extra effort required.
  • Cost: The cost is $3 per million input tokens and $15 per million output tokens.

o1-preview and o1-mini (Web Interface)

  • Context Limitations: Although OpenAI states that 'In ChatGPT, the context windows for o1-preview and o1-mini is 32k,' in reality, if the game was longer than 1000 lines of code (around 8k tokens), the model would think for 120-150 seconds and then crash. I eventually stopped iterating on some of the games because of this limitation.
  • Visual Design Challenges: For the visual design of the games, both o1-preview and o1-mini were much worse than Claude 3.5 Sonnet, although they were able to implement more functional games. I ended up with having some small non-functioning versions produced by Claude 3.5 Sonnet and iterating on them with o1-preview and mainly o1-mini.
  • Cross-Platform Compatibility Issues: When trying to adjust the games to be playable on Mobile, it caused a lot of issues with the Desktop version and vice versa. Both models often made unnecessary changes to CSS, which led to multiple bugs.
  • Requests Limitations: I ran out of requests for o1-preview relatively shortly and needed to wait 4 days before I could use it again. The limits are access to 50 messages a week with OpenAI o1-preview and 50 messages a day with OpenAI o1-mini.


Critical Analysis: Where Do LLMs Fall Short for Production Use, and Could They Really Build an Entire Project?

Barriers to Production Readiness

1. Technical Limitations of the Current LLMs

  • The output context length is too little for moderately complex applications. Claude 3.5 Sonnet has a max limit of 2k tokens and, on paper, 8k tokens. o1-preview and o1-mini have a max limit of 8k tokens and, on paper, 32k tokens. This needs to increase to an actual limit of 64-128k tokens for the output to be feasible to build complex websites.
  • The request limits for o1-preview are too restrictive, and the cost for all models via the API is prohibitively high.
  • All models seem to be trained on older versions of frameworks and libraries, which limits their ability to adjust and effectively use the documentation of projects in their current, web-accessible form meant for human use.

2. Limitations in Web and 2D Game Development

  • Claude 3.5 Sonnet, o1-preview, and o1-mini seem to have a hard time implementing complex game mechanics. I used Claude 3 Opus with Phaser 3 for another project involving building a Tetris clone and faced many struggles, particularly with using a non-typical physics engine beyond the most basic tutorials. In general, there seems to be a need for extensive training on complex games, as the current models are not reliable for these tasks.
  • For scoring systems, only o1-preview did a somewhat okay job, while Claude 3.5 Sonnet and o1-mini performed very poorly. Even with o1-preview, the model generally did a bad job and required several iterations to create something that was somewhat interesting to play.
  • Difficulties with cross-platform adjustments (Desktop vs Mobile), leading to issues such as CSS inconsistencies and functionality breaking when switching platforms. All models also seem to struggle with positioning elements accurately.

3. Maintainability

  • Currently, the model is not able to iterate reliably on relatively complex applications, even in the case of single-page applications where the entire code is in one place and is around 8k tokens—much less than the input context length limits on Claude 3.5 Sonnet or any OpenAI model. However, the models are unable to consistently apply changes or tackle complex bugs, such as UI elements that are difficult to position correctly on multiple devices. As a result, the generated code is difficult to maintain and lacks modular structure that would allow for easy updates or feature additions.
  • Currently, there are not many frameworks that have documentation that is easily accessible for LLMs. For example, I know only FastHTML, which has an llms-ctx.txt file that has really well-working documentation, easily picked up by Claude 3.5 Sonnet in Cursor. Without such documentation, any upgrade of the framework will not be usable for websites built with LLMs because to get it to the LLM you will need to wait for new releases that eventually include the updated documentation in the training data. This is unfortunately not practical. As a result, insufficient documentation made understanding and modifying the generated code a significant challenge.
  • Due to the unreliability of the models and their tendency to introduce new bugs when addressing minor issues, maintaining a moderately complex application remains a significant challenge. The frequent occurrence of regressions and inconsistent behaviour makes it difficult to be use in production.


Could LLMs Really Build an Entire Project?

The experiment of building an entire project using LLMs for web and game development was insightful but ultimately frustrating. While it showed that LLMs can generate functional and somewhat original game concepts and implement them into playable games without manual coding, the process was plagued by challenges. The responsive website and the 11 games highlight the potential of these AI tools, but significant limitations hinder their production readiness. Technical constraints such as context length, request limits, and high costs pose barriers. Moreover, the models struggle with complex game mechanics, scoring systems, and cross-platform compatibility, making the generated code difficult to maintain and iterate upon.

To make LLMs more viable for real-world use, several improvements are necessary. Increasing the output context length, reducing request limits, and lowering costs are essential steps. Furthermore, enhancing the models' training on complex games and improving their ability to handle documentation and maintainability will be crucial.


Next Steps: Iterating on Current Games

I plan to continue iterating on the games to see if new models can improve them. This will serve as a test for the abilities of newer LLMs and whether they demonstrate meaningful advancements compared to the current models.

LinkedIn | GitHub

#GameDevelopment #AI #LLMTesting #AIWebDevelopment #AIProgramming

Ovanes Pentchev

Analysis Specialist at GfK | Certified Data Analyst & Engineer

4 个月

Well-done, Aleks!!!

要查看或添加评论,请登录

Aleksandar Dimov的更多文章

  • Benchmarking o1 on Advent of Code 2024

    Benchmarking o1 on Advent of Code 2024

    1. Main Aim of the Evaluation The primary goal of this evaluation is to assess how o1 performs compared to GPT-4-turbo…

  • Advent of Code Solutions Dataset

    Advent of Code Solutions Dataset

    Introduction This dataset contains over 10,000 solutions and input data for the Advent of Code programming puzzles from…

    2 条评论
  • Coderev: AI-Powered Code Review from the Command Line

    Coderev: AI-Powered Code Review from the Command Line

    Introduction I'm pleased to share Coderev, an AI-powered CLI tool that simplifies the code review process for…

    2 条评论

社区洞察

其他会员也浏览了