Practical AI Coding Test: Creating a Basic But Useful Web App
Update 3-Feb-2025
The Deepseek-R1 671B model with 4-bit quantisation successfully one-shotted this prompt, putting it at par with Claude 3.5v2 and Gemini 1206. I ran it on CPU (r7i.24xlarge) which took 20 minutes at about 4 tokens/second. The model with 12288 num_ctx requires about 600GB of memory, hence does not fit on eight Nvidia L40S (368GB VRAM), it requires eight A100 80GB or H100 to run on GPU.
The Llama-3.3-70B Deepseek distill with temperature 0.7 on AWS Inferentia inf2.48x also successfully one-shots the challenge prompt as well, generating at 33 tokens/second. It appears that the higher precision on Inferentia2 does result in better outcomes versus the 4-bpw quant that I tested previously.
Update 27-Jan-2025
The Deepseek-R1 open weights (MIT licence) LLM was released last week, with claimed GPT-4o beating performance. Since then, quantised distillations (as well as the full, quantised model) have become available for Ollama and llama.cpp.
TL; DR - the 70B distillation for Llama-3 can produce a (mostly) working, though ugly, implementation of the challenge prompt in this article. Throughput is 5.5 tokens/second. Here's the generated code. I say "mostly" because the customisation modal doesn't work, and manually editing the exported JSON to edit the child's name is not sticky on import. However, the majority of the functionality works.
Introduction
It's been an exciting week in the SOTA LLM space with the release of Llama-3.3-70B, QwQ-32B-Preview, and Amazon Nova. Google has just released the experimental 1206 release of Gemini (6-Dec-2024) which is currently #1 on the LMArena leaderboard, while Athene-v2-Chat-72B and Llama-3.1-Nemotron-70B-Instruct are the highest-scoring open weight LLMs.
These models represent significant advances in code generation capabilities, with each claiming substantial improvements over older models in areas like reasoning, instruction following, and code synthesis.
This very recent article LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs makes the claim that QwQ-32B-Preview tops most other SOTA models (including the cloud-only models) for the author's specific benchmarks. My experience was rather.. different.
While benchmarks like HumanEval and MMLU provide standardised ways to compare these models, real-world applications often present more nuanced challenges. Code generation tasks in particular require models to simultaneously handle natural language understanding, technical specification comprehension, and the production of functional code.
This article presents a practical test of these models' capabilities through a specific challenge: generating a complete, functional web application from a detailed prompt. The task is particularly interesting because it requires the model to:
The test is probably unrealistic because most developers would use smaller prompts iteratively to generate and improve small sections of code, not an entire application (albeit a small one).
Background and Motivation
The Python code generation article I wrote in late September 2024 and been regularly updating is no longer quite so interesting - Claude 3.5 is no longer the only model that can zero-shot the most complex prompt in that article (Amazon Nova Pro and Nemotron both successfully generated functional code, I have not tested QwQ-32B-Preview or Llama-3.3-70B).
Time for a slightly more rigorous test.
Over the past week or so, I have been trying to find an app that can be used to track behaviour (a "brownie point ledger," as it were), incentivise good behaviour, and withhold incentives for unacceptable behaviour. I could not find anything I liked, and I didn't want to set up infrastructure just to keep track of it. So I decided to write a prompt that would generate such an app for me.
The key requirements were:
Test Methodology
My testing framework evaluated the models' ability to generate a complete, functional web application in a single zero-shot attempt. Here are the details of the testing process:
Test Environment
Model Configurations
Evaluation Criteria
The generated code was evaluated across these key dimensions:
1. Initial Rendering
2. Core Functionality
3. UI Components
Testing Process
1. Copied the model's complete output into a new HTML file
2. Opened the file in Chrome MacOS (or: uploaded the HTML to S3 and downloaded it into Chrome iOS)
3. Checked browser console for any immediate errors on MacOS
4. Tested each feature according to a standard checklist:
领英推荐
Success Criteria
For a model's output to be considered "fully functional", it needed to:
1. Render without JavaScript errors
2. Implement all specified features
3. Maintain data persistence across page reloads
4. Handle import/export without data corruption
5. Display correctly on both desktop and mobile
Notes on Methodology Limitations
Disclaimers and Limitations
Development Process
It took me several hours of prompting to get everything right with my model of choice. Once I got the desired results, I manually consolidated all my prompts and follow-up chat messages into a single, extremely detailed prompt. I then started a new chat session, injected the single large prompt, and validated the generated code.
If I had done the same iterations with the other models, it's possible or perhaps even likely that I would have gotten functional code, but the prompt is quite straightforward and doesn't do anything model-specific.
The Challenge Prompt
Write a browser-only web application that allows to log merit, demerit, and redemption points for good behaviour. The generated artifact must be a single HTML file. use tailwind and lucide for css. do not use lucide for the gear, use a gear icon or emoji. use a nice rounded sans serif font like Inter for the user interface. do not use the tailwind development CDN, use cloudflare to fetch the scripts.
Ensure that UI elements fit onscreen and do not require left-right scrolling on a phone.
Add no-cache directives to the top of the HTML.
"Total Points: X" is shown with the running total of points on its own line at the very top right of the form.
On the next line is the title "Child's Points Ledger" left-justified in larger text.
Display a gear icon to the left of the title for customisation. Clicking the gear will open a form to customise the child's name, and a checkbox with "Clear data" to clear the entire table. When the child's name is customised, the title should update.
The next line has these elements:
- "Date" label (with a calendar dropdown, defaulting to the current date); Dates should be in dd/mm/yy format.
; on a phone, selecting a date should accept it, without needing to press "Done"
- "Point Type" (merit, demerit, or redemption) with dropdown
- a numeric entry field with the points. This field must switch the browser to digits only entry mode.
On the next line is "Description:" label and a text field with the description.
Demerits and redemptions should subtract from the total score. Only merits add to the total.
On the next line are buttons for Log Entry, Export, and Import.
Next is a table showing all the accrued points, most recent first. Demerit entries have a red background and white text, redemptions green background and black text, and merits black text on white background. There is a single pixel black line between cell rows, and the table is scrollable and displays 15 rows at a time. The table text is a smaller font than the rest of the form.
When Log Entry is pressed, the entered data (if all present) are saved into localStorage. The newest entry is added to top of the table. The total points is recalculated and display updated. On first load, all data must be loaded from localStorage and the total points calculated. Make sure that the table is sorted by date descending on load.
On Export, the child's configured name and all data are exported to a JSON file. All fields must properly escape special characters.
On import, load the JSON file, update the child's name, populate the table, and recalculate points. Make sure that the table is sorted by date descending on load.
Here is a sample JSON document for export or import:
<sample_json>
{
"version": "1.0",
"settings": {
"childName": "Emily",
"lastModified": "2024-12-06T10:30:00Z"
},
"entries": [
{
"id": "2024120601",
"date": "2024-12-06",
"type": "merit",
"points": 5,
"description": "for helping set and clear the table"
},
{
"id": "2024120602",
"type": "redemption",
"date": "2024-12-06",
"points": 10,
"description": "Toca Boca World furniture pack"
}
]
}
</sample_json>
Models Evaluated
Results and Analysis
Both Claude-3.5 Sonnet v2 and Google Gemini Experimental 1206 produced functional code with the above prompt.
Interestingly (and corroborating the LMArena results) Athene was able to produce working code, although the layout was incorrect and the settings popup was not a popup at all. However, the basic functionality (logging of points, import and export) were there.
The other LLMs produced HTML that either didn't render properly, completely didn't work (e.g. Javascript errors in the browser developer console), or only had partial functionality (some actions worked). Nemotron produced partially working code, better than the other open-weight models You can try all of the versions at the links above.
Conclusions
While this test represents a narrow use case, it provides an insight into the current state of LLM capabilities in code generation tasks. The results suggest that while we're making progress in automated code generation, we're still in a phase where human expertise and iterative development remain crucial for successful outcomes. I find that writing an effective prompt for code generation is very much still programming, except with natural language.
Here are the key findings and their implications:
Model Performance Gap
The still significant difference between Claude 3.5 Sonnet v2 and Google Gemini's successful generation and the lack of success of other models, including those with comparable parameter counts, suggests that the "scaling law" and industry trend toward ever-larger models, is not the sole differentiator for specific model use cases.
Important: I'm too cheap to pay for OpenAI or Microsoft frontier model access. I would be grateful if one of my dear readers would feed the prompt into those models and see what they get.
Zero-Shot Complexity Threshold
This test reveals an interesting threshold in zero-shot code generation capabilities. While most tested models can handle simpler coding tasks (as evidenced by my previous Python code generation tests), the complexity of an entire web application with state management, UI interactions, and data persistence appears to be beyond the current capabilities of most models without iterative refinement.
Development Process Implications
The fact that it took several hours of iterative prompting to arrive at a working solution with Claude suggests that even the most capable models still benefit from human guidance and refinement. This indicates that LLMs are currently best viewed as collaborative tools rather than autonomous developers, even for seemingly straightforward applications.
Engineering Trade-offs
The test highlights an interesting trade-off in prompt engineering: while a single comprehensive prompt might be ideal for reproducibility and testing, it may not be the most effective approach for actually developing applications. The iterative approach that eventually led to the working solution suggests that breaking down complex requirements into smaller, manageable chunks might be more practical for real-world development.
Cloud Solution Architect @ Microsoft | Cloud and AI Professional
1 个月insightful stuff Orlando Andico
Lead Solutions Architect at Amazon Web Services (AWS)
1 个月Nice experiment Orlando Andico !
Solutions Architect - Amazon Web Services (AWS). Spam me with Sales or Training stuff & I'll immediately delete you.
3 个月This is great Orlando Andico