Practical AI Coding Test: Creating a Basic But Useful Web App

Practical AI Coding Test: Creating a Basic But Useful Web App

Update 3-Feb-2025

The Deepseek-R1 671B model with 4-bit quantisation successfully one-shotted this prompt, putting it at par with Claude 3.5v2 and Gemini 1206. I ran it on CPU (r7i.24xlarge) which took 20 minutes at about 4 tokens/second. The model with 12288 num_ctx requires about 600GB of memory, hence does not fit on eight Nvidia L40S (368GB VRAM), it requires eight A100 80GB or H100 to run on GPU.

The Llama-3.3-70B Deepseek distill with temperature 0.7 on AWS Inferentia inf2.48x also successfully one-shots the challenge prompt as well, generating at 33 tokens/second. It appears that the higher precision on Inferentia2 does result in better outcomes versus the 4-bpw quant that I tested previously.


Update 27-Jan-2025

The Deepseek-R1 open weights (MIT licence) LLM was released last week, with claimed GPT-4o beating performance. Since then, quantised distillations (as well as the full, quantised model) have become available for Ollama and llama.cpp.

TL; DR - the 70B distillation for Llama-3 can produce a (mostly) working, though ugly, implementation of the challenge prompt in this article. Throughput is 5.5 tokens/second. Here's the generated code. I say "mostly" because the customisation modal doesn't work, and manually editing the exported JSON to edit the child's name is not sticky on import. However, the majority of the functionality works.


Introduction

It's been an exciting week in the SOTA LLM space with the release of Llama-3.3-70B, QwQ-32B-Preview, and Amazon Nova. Google has just released the experimental 1206 release of Gemini (6-Dec-2024) which is currently #1 on the LMArena leaderboard, while Athene-v2-Chat-72B and Llama-3.1-Nemotron-70B-Instruct are the highest-scoring open weight LLMs.

These models represent significant advances in code generation capabilities, with each claiming substantial improvements over older models in areas like reasoning, instruction following, and code synthesis.

This very recent article LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs makes the claim that QwQ-32B-Preview tops most other SOTA models (including the cloud-only models) for the author's specific benchmarks. My experience was rather.. different.

While benchmarks like HumanEval and MMLU provide standardised ways to compare these models, real-world applications often present more nuanced challenges. Code generation tasks in particular require models to simultaneously handle natural language understanding, technical specification comprehension, and the production of functional code.

This article presents a practical test of these models' capabilities through a specific challenge: generating a complete, functional web application from a detailed prompt. The task is particularly interesting because it requires the model to:

  • Generate properly structured HTML, CSS, and JavaScript
  • Implement complex UI interactions
  • Handle data persistence and state management
  • Process user input and validate data
  • Manage file imports and exports
  • Create a responsive, mobile-friendly interface

The test is probably unrealistic because most developers would use smaller prompts iteratively to generate and improve small sections of code, not an entire application (albeit a small one).

Background and Motivation

The Python code generation article I wrote in late September 2024 and been regularly updating is no longer quite so interesting - Claude 3.5 is no longer the only model that can zero-shot the most complex prompt in that article (Amazon Nova Pro and Nemotron both successfully generated functional code, I have not tested QwQ-32B-Preview or Llama-3.3-70B).

Time for a slightly more rigorous test.

Over the past week or so, I have been trying to find an app that can be used to track behaviour (a "brownie point ledger," as it were), incentivise good behaviour, and withhold incentives for unacceptable behaviour. I could not find anything I liked, and I didn't want to set up infrastructure just to keep track of it. So I decided to write a prompt that would generate such an app for me.

The key requirements were:

  • Must be static HTML
  • Require no server infrastructure beyond an S3 bucket to hold the static HTML
  • Must have local persistence (no remote database)
  • A way to import or export records


Test Methodology

My testing framework evaluated the models' ability to generate a complete, functional web application in a single zero-shot attempt. Here are the details of the testing process:

Test Environment

  • Browser: Google Chrome Version 131.0.6778.109 (Official Build) (x86_64) on MacOS Sonoma 14.7; Google Chrome 131.0.6778.73 (64-bit) on iOS 18.1.1
  • Date of Testing: 2024-12-06
  • Device: MacBook Pro i9; iPhone 14 Pro

Model Configurations

  • Claude 3.5 Sonnet v2 20241022: via Anthropic web interface, default parameters
  • Google Gemini Experimental 1206, temperature 0, topP 0.95, maximum output 8192
  • Amazon Nova Pro 1.0: via Bedrock playground, temperature 0, topP 0.9, maximum output 5120
  • ChatGPT-4 (free tier): via web interface, default parameters
  • Llama-3.3-70B: via Ollama, 4-bit quantised, 2x Nvidia P40 (hardware described in this article), otherwise default model parameters
  • QwQ-32B-Preview: via Ollama, 4-bit quantised, same hardware used as Llama-3.3, otherwise default model parameters
  • Llama-3.1-Nemotron-70B-Instruct, 4-bit quantised, same hardware used as Llama-3.3, otherwise default model parameters
  • Athene-V2-Chat-72B, 4-bit quantised, same hardware used as Llama-3.3, otherwise default model parameters - this was particularly slow as only 78 of the 81 layers loaded onto the GPU, resulting in 1-2 tokens/second


Evaluation Criteria

The generated code was evaluated across these key dimensions:

1. Initial Rendering

  • Does the HTML file load without errors?
  • Are all UI elements visible and properly positioned?
  • Is the mobile layout functional?


2. Core Functionality

  • Data entry and validation
  • Point calculation
  • Local storage persistence
  • Import/export functionality


3. UI Components

  • Date picker behavior
  • Settings modal
  • Table display and scrolling
  • Color coding of entries


Testing Process

1. Copied the model's complete output into a new HTML file

2. Opened the file in Chrome MacOS (or: uploaded the HTML to S3 and downloaded it into Chrome iOS)

3. Checked browser console for any immediate errors on MacOS

4. Tested each feature according to a standard checklist:

  • Check if it renders properly (only one model produced code that rendered properly)
  • Enter new merit point
  • Enter new demerit point
  • Enter new redemption
  • Verify point calculations
  • Export data
  • Import test data
  • Clear all data
  • Change child's name

Success Criteria

For a model's output to be considered "fully functional", it needed to:

1. Render without JavaScript errors

2. Implement all specified features

3. Maintain data persistence across page reloads

4. Handle import/export without data corruption

5. Display correctly on both desktop and mobile


Notes on Methodology Limitations

  • Testing was limited to a single attempt per model
  • Edge cases (non-Latin characters, large datasets, etc.) were not tested
  • Performance optimisation was not considered
  • Security considerations were not evaluated



Disclaimers and Limitations

  • I am aware that this benchmark has broadly different results from what I describe here
  • Meta has claimed that Llama-3.3 outperforms Amazon Nova Pro and several other SOTA models on various benchmarks
  • These results are for a sample size of exactly 1, with my very specific prompt, therefore take my conclusions as the personal experience of one AI experimenter
  • The opinions expressed in this article are my own, and do not reflect those of my employer, nor are they endorsed or approved by my employer


Development Process

It took me several hours of prompting to get everything right with my model of choice. Once I got the desired results, I manually consolidated all my prompts and follow-up chat messages into a single, extremely detailed prompt. I then started a new chat session, injected the single large prompt, and validated the generated code.

If I had done the same iterations with the other models, it's possible or perhaps even likely that I would have gotten functional code, but the prompt is quite straightforward and doesn't do anything model-specific.


The Challenge Prompt

Write a browser-only web application that allows to log merit, demerit, and redemption points for good behaviour. The generated artifact must be a single HTML file. use tailwind and lucide for css. do not use lucide for the gear, use a gear icon or emoji. use a nice rounded sans serif font like Inter for the user interface. do not use the tailwind development CDN, use cloudflare to fetch the scripts.

Ensure that UI elements fit onscreen and do not require left-right scrolling on a phone.

Add no-cache directives to the top of the HTML.

"Total Points: X" is shown with the running total of points on its own line at the very top right of the form.

On the next line is the title "Child's Points Ledger" left-justified in larger text.

Display a gear icon to the left of the title for customisation. Clicking the gear will open a form to customise the child's name, and a checkbox with "Clear data" to clear the entire table. When the child's name is customised, the title should update.

The next line has these elements:
- "Date" label (with a calendar dropdown, defaulting to the current date); Dates should be in dd/mm/yy format.
; on a phone, selecting a date should accept it, without needing to press "Done"
- "Point Type" (merit, demerit, or redemption) with dropdown
- a numeric entry field with the points. This field must switch the browser to digits only entry mode.

On the next line is "Description:" label and a text field with the description. 

Demerits and redemptions should subtract from the total score. Only merits add to the total.

On the next line are buttons for Log Entry, Export, and Import.

Next is a table showing all the accrued points, most recent first. Demerit entries have a red background and white text, redemptions green background and black text, and merits black text on white background. There is a single pixel black line between cell rows, and the table is scrollable and displays 15 rows at a time. The table text is a smaller font than the rest of the form.

When Log Entry is pressed, the entered data (if all present) are saved into localStorage. The newest entry is added to top of the table. The total points is recalculated and display updated. On first load, all data must be loaded from localStorage and the total points calculated. Make sure that the table is sorted by date descending on load.

On Export, the child's configured name and all data are exported to a JSON file. All fields must properly escape special characters.

On import, load the JSON file, update the child's name, populate the table, and recalculate points. Make sure that the table is sorted by date descending on load.

Here is a sample JSON document for export or import:

<sample_json>
{
  "version": "1.0",
  "settings": {
    "childName": "Emily",
    "lastModified": "2024-12-06T10:30:00Z"
  },
  "entries": [
    {
      "id": "2024120601",
      "date": "2024-12-06",
      "type": "merit",
      "points": 5,
      "description": "for helping set and clear the table"
    },
    {
      "id": "2024120602",
      "type": "redemption",
      "date": "2024-12-06",
      "points": 10,
      "description": "Toca Boca World furniture pack"
    }
  ]
}
</sample_json>        

Models Evaluated

  • Claude 3.5 Sonnet v2, using my Anthropic subscription (output)
  • Google Gemini Experimental 1206 (output)
  • Amazon Nova Pro 1.0, using the Amazon Bedrock chat playground (output)
  • Llama-3.1-Nemotron-70B-Instruct on Ollama, 4-bit quant (output)
  • ChatGPT-4o, using the free tier (hence: not as capable as their paid models, but I didn't want to restart my ChatGPT subscription) (output)
  • Llama-3.3-70B on Ollama, 4-bit quant (output)
  • QwQ-32B-Preview on Ollama, 4-bit quant (output)
  • Athene-v2-Chat-72B on Ollama, 4-bit quant (output)


Results and Analysis

Both Claude-3.5 Sonnet v2 and Google Gemini Experimental 1206 produced functional code with the above prompt.

Interestingly (and corroborating the LMArena results) Athene was able to produce working code, although the layout was incorrect and the settings popup was not a popup at all. However, the basic functionality (logging of points, import and export) were there.

The other LLMs produced HTML that either didn't render properly, completely didn't work (e.g. Javascript errors in the browser developer console), or only had partial functionality (some actions worked). Nemotron produced partially working code, better than the other open-weight models You can try all of the versions at the links above.

Conclusions

While this test represents a narrow use case, it provides an insight into the current state of LLM capabilities in code generation tasks. The results suggest that while we're making progress in automated code generation, we're still in a phase where human expertise and iterative development remain crucial for successful outcomes. I find that writing an effective prompt for code generation is very much still programming, except with natural language.

Here are the key findings and their implications:


Model Performance Gap

The still significant difference between Claude 3.5 Sonnet v2 and Google Gemini's successful generation and the lack of success of other models, including those with comparable parameter counts, suggests that the "scaling law" and industry trend toward ever-larger models, is not the sole differentiator for specific model use cases.

Important: I'm too cheap to pay for OpenAI or Microsoft frontier model access. I would be grateful if one of my dear readers would feed the prompt into those models and see what they get.


Zero-Shot Complexity Threshold

This test reveals an interesting threshold in zero-shot code generation capabilities. While most tested models can handle simpler coding tasks (as evidenced by my previous Python code generation tests), the complexity of an entire web application with state management, UI interactions, and data persistence appears to be beyond the current capabilities of most models without iterative refinement.


Development Process Implications

The fact that it took several hours of iterative prompting to arrive at a working solution with Claude suggests that even the most capable models still benefit from human guidance and refinement. This indicates that LLMs are currently best viewed as collaborative tools rather than autonomous developers, even for seemingly straightforward applications.


Engineering Trade-offs

The test highlights an interesting trade-off in prompt engineering: while a single comprehensive prompt might be ideal for reproducibility and testing, it may not be the most effective approach for actually developing applications. The iterative approach that eventually led to the working solution suggests that breaking down complex requirements into smaller, manageable chunks might be more practical for real-world development.

Tarun Tyagi

Cloud Solution Architect @ Microsoft | Cloud and AI Professional

1 个月

insightful stuff Orlando Andico

回复
Shirisha Vivekanand

Lead Solutions Architect at Amazon Web Services (AWS)

1 个月

Nice experiment Orlando Andico !

回复
Col Willis

Solutions Architect - Amazon Web Services (AWS). Spam me with Sales or Training stuff & I'll immediately delete you.

3 个月

This is great Orlando Andico

要查看或添加评论,请登录

Orlando Andico的更多文章

社区洞察

其他会员也浏览了