登录查看更多内容

Practical AI Coding Test: Creating a Basic But Useful Web App

Orlando Andico

Cloud Native Databases, Infrastructure, AI | 15x AWS Certified

发布日期: 2024年12月7日

Update 3-Feb-2025

The Deepseek-R1 671B model with 4-bit quantisation successfully one-shotted this prompt, putting it at par with Claude 3.5v2 and Gemini 1206. I ran it on CPU (r7i.24xlarge) which took 20 minutes at about 4 tokens/second. The model with 12288 num_ctx requires about 600GB of memory, hence does not fit on eight Nvidia L40S (368GB VRAM), it requires eight A100 80GB or H100 to run on GPU.

The Llama-3.3-70B Deepseek distill with temperature 0.7 on AWS Inferentia inf2.48x also successfully one-shots the challenge prompt as well, generating at 33 tokens/second. It appears that the higher precision on Inferentia2 does result in better outcomes versus the 4-bpw quant that I tested previously.

Update 27-Jan-2025

The Deepseek-R1 open weights (MIT licence) LLM was released last week, with claimed GPT-4o beating performance. Since then, quantised distillations (as well as the full, quantised model) have become available for Ollama and llama.cpp.

TL; DR - the 70B distillation for Llama-3 can produce a (mostly) working, though ugly, implementation of the challenge prompt in this article. Throughput is 5.5 tokens/second. Here's the generated code. I say "mostly" because the customisation modal doesn't work, and manually editing the exported JSON to edit the child's name is not sticky on import. However, the majority of the functionality works.

Introduction

It's been an exciting week in the SOTA LLM space with the release of Llama-3.3-70B, QwQ-32B-Preview, and Amazon Nova. Google has just released the experimental 1206 release of Gemini (6-Dec-2024) which is currently #1 on the LMArena leaderboard, while Athene-v2-Chat-72B and Llama-3.1-Nemotron-70B-Instruct are the highest-scoring open weight LLMs.

These models represent significant advances in code generation capabilities, with each claiming substantial improvements over older models in areas like reasoning, instruction following, and code synthesis.

This very recent article LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs makes the claim that QwQ-32B-Preview tops most other SOTA models (including the cloud-only models) for the author's specific benchmarks. My experience was rather.. different.

While benchmarks like HumanEval and MMLU provide standardised ways to compare these models, real-world applications often present more nuanced challenges. Code generation tasks in particular require models to simultaneously handle natural language understanding, technical specification comprehension, and the production of functional code.

This article presents a practical test of these models' capabilities through a specific challenge: generating a complete, functional web application from a detailed prompt. The task is particularly interesting because it requires the model to:

Generate properly structured HTML, CSS, and JavaScript
Implement complex UI interactions
Handle data persistence and state management
Process user input and validate data
Manage file imports and exports
Create a responsive, mobile-friendly interface

The test is probably unrealistic because most developers would use smaller prompts iteratively to generate and improve small sections of code, not an entire application (albeit a small one).

Background and Motivation

The Python code generation article I wrote in late September 2024 and been regularly updating is no longer quite so interesting - Claude 3.5 is no longer the only model that can zero-shot the most complex prompt in that article (Amazon Nova Pro and Nemotron both successfully generated functional code, I have not tested QwQ-32B-Preview or Llama-3.3-70B).

Time for a slightly more rigorous test.

Over the past week or so, I have been trying to find an app that can be used to track behaviour (a "brownie point ledger," as it were), incentivise good behaviour, and withhold incentives for unacceptable behaviour. I could not find anything I liked, and I didn't want to set up infrastructure just to keep track of it. So I decided to write a prompt that would generate such an app for me.

The key requirements were:

Must be static HTML
Require no server infrastructure beyond an S3 bucket to hold the static HTML
Must have local persistence (no remote database)
A way to import or export records

Test Methodology

My testing framework evaluated the models' ability to generate a complete, functional web application in a single zero-shot attempt. Here are the details of the testing process:

Test Environment

Browser: Google Chrome Version 131.0.6778.109 (Official Build) (x86_64) on MacOS Sonoma 14.7; Google Chrome 131.0.6778.73 (64-bit) on iOS 18.1.1
Date of Testing: 2024-12-06
Device: MacBook Pro i9; iPhone 14 Pro

Model Configurations

Claude 3.5 Sonnet v2 20241022: via Anthropic web interface, default parameters
Google Gemini Experimental 1206, temperature 0, topP 0.95, maximum output 8192
Amazon Nova Pro 1.0: via Bedrock playground, temperature 0, topP 0.9, maximum output 5120
ChatGPT-4 (free tier): via web interface, default parameters
Llama-3.3-70B: via Ollama, 4-bit quantised, 2x Nvidia P40 (hardware described in this article), otherwise default model parameters
QwQ-32B-Preview: via Ollama, 4-bit quantised, same hardware used as Llama-3.3, otherwise default model parameters
Llama-3.1-Nemotron-70B-Instruct, 4-bit quantised, same hardware used as Llama-3.3, otherwise default model parameters
Athene-V2-Chat-72B, 4-bit quantised, same hardware used as Llama-3.3, otherwise default model parameters - this was particularly slow as only 78 of the 81 layers loaded onto the GPU, resulting in 1-2 tokens/second

Evaluation Criteria

The generated code was evaluated across these key dimensions:

1. Initial Rendering

Does the HTML file load without errors?
Are all UI elements visible and properly positioned?
Is the mobile layout functional?

2. Core Functionality

Data entry and validation
Point calculation
Local storage persistence
Import/export functionality

3. UI Components

Date picker behavior
Settings modal
Table display and scrolling
Color coding of entries

Testing Process

1. Copied the model's complete output into a new HTML file

2. Opened the file in Chrome MacOS (or: uploaded the HTML to S3 and downloaded it into Chrome iOS)

3. Checked browser console for any immediate errors on MacOS

4. Tested each feature according to a standard checklist:

领英推荐

TAI #104; LLM progress beyond transformers with Samba?

Towards AI 8 个月前

Revolutionising AI: Anthropic's New Models, Stability…

Katonic AI 1 年前

GPU vs LPU: Which Is Better for AI Workloads?

CUDO Compute 9 个月前

Check if it renders properly (only one model produced code that rendered properly)
Enter new merit point
Enter new demerit point
Enter new redemption
Verify point calculations
Export data
Import test data
Clear all data
Change child's name

Success Criteria

For a model's output to be considered "fully functional", it needed to:

1. Render without JavaScript errors

2. Implement all specified features

3. Maintain data persistence across page reloads

4. Handle import/export without data corruption

5. Display correctly on both desktop and mobile

Notes on Methodology Limitations

Testing was limited to a single attempt per model
Edge cases (non-Latin characters, large datasets, etc.) were not tested
Performance optimisation was not considered
Security considerations were not evaluated

Disclaimers and Limitations

I am aware that this benchmark has broadly different results from what I describe here
Meta has claimed that Llama-3.3 outperforms Amazon Nova Pro and several other SOTA models on various benchmarks
These results are for a sample size of exactly 1, with my very specific prompt, therefore take my conclusions as the personal experience of one AI experimenter
The opinions expressed in this article are my own, and do not reflect those of my employer, nor are they endorsed or approved by my employer

Development Process

It took me several hours of prompting to get everything right with my model of choice. Once I got the desired results, I manually consolidated all my prompts and follow-up chat messages into a single, extremely detailed prompt. I then started a new chat session, injected the single large prompt, and validated the generated code.

If I had done the same iterations with the other models, it's possible or perhaps even likely that I would have gotten functional code, but the prompt is quite straightforward and doesn't do anything model-specific.

The Challenge Prompt

Write a browser-only web application that allows to log merit, demerit, and redemption points for good behaviour. The generated artifact must be a single HTML file. use tailwind and lucide for css. do not use lucide for the gear, use a gear icon or emoji. use a nice rounded sans serif font like Inter for the user interface. do not use the tailwind development CDN, use cloudflare to fetch the scripts.

Ensure that UI elements fit onscreen and do not require left-right scrolling on a phone.

Add no-cache directives to the top of the HTML.

"Total Points: X" is shown with the running total of points on its own line at the very top right of the form.

On the next line is the title "Child's Points Ledger" left-justified in larger text.

Display a gear icon to the left of the title for customisation. Clicking the gear will open a form to customise the child's name, and a checkbox with "Clear data" to clear the entire table. When the child's name is customised, the title should update.

The next line has these elements:
- "Date" label (with a calendar dropdown, defaulting to the current date); Dates should be in dd/mm/yy format.
; on a phone, selecting a date should accept it, without needing to press "Done"
- "Point Type" (merit, demerit, or redemption) with dropdown
- a numeric entry field with the points. This field must switch the browser to digits only entry mode.

On the next line is "Description:" label and a text field with the description. 

Demerits and redemptions should subtract from the total score. Only merits add to the total.

On the next line are buttons for Log Entry, Export, and Import.

Next is a table showing all the accrued points, most recent first. Demerit entries have a red background and white text, redemptions green background and black text, and merits black text on white background. There is a single pixel black line between cell rows, and the table is scrollable and displays 15 rows at a time. The table text is a smaller font than the rest of the form.

When Log Entry is pressed, the entered data (if all present) are saved into localStorage. The newest entry is added to top of the table. The total points is recalculated and display updated. On first load, all data must be loaded from localStorage and the total points calculated. Make sure that the table is sorted by date descending on load.

On Export, the child's configured name and all data are exported to a JSON file. All fields must properly escape special characters.

On import, load the JSON file, update the child's name, populate the table, and recalculate points. Make sure that the table is sorted by date descending on load.

Here is a sample JSON document for export or import:

<sample_json>
{
  "version": "1.0",
  "settings": {
    "childName": "Emily",
    "lastModified": "2024-12-06T10:30:00Z"
  },
  "entries": [
    {
      "id": "2024120601",
      "date": "2024-12-06",
      "type": "merit",
      "points": 5,
      "description": "for helping set and clear the table"
    },
    {
      "id": "2024120602",
      "type": "redemption",
      "date": "2024-12-06",
      "points": 10,
      "description": "Toca Boca World furniture pack"
    }
  ]
}
</sample_json>

Models Evaluated

Claude 3.5 Sonnet v2, using my Anthropic subscription (output)
Google Gemini Experimental 1206 (output)
Amazon Nova Pro 1.0, using the Amazon Bedrock chat playground (output)
Llama-3.1-Nemotron-70B-Instruct on Ollama, 4-bit quant (output)
ChatGPT-4o, using the free tier (hence: not as capable as their paid models, but I didn't want to restart my ChatGPT subscription) (output)
Llama-3.3-70B on Ollama, 4-bit quant (output)
QwQ-32B-Preview on Ollama, 4-bit quant (output)
Athene-v2-Chat-72B on Ollama, 4-bit quant (output)

Results and Analysis

Both Claude-3.5 Sonnet v2 and Google Gemini Experimental 1206 produced functional code with the above prompt.

Interestingly (and corroborating the LMArena results) Athene was able to produce working code, although the layout was incorrect and the settings popup was not a popup at all. However, the basic functionality (logging of points, import and export) were there.

The other LLMs produced HTML that either didn't render properly, completely didn't work (e.g. Javascript errors in the browser developer console), or only had partial functionality (some actions worked). Nemotron produced partially working code, better than the other open-weight models You can try all of the versions at the links above.

Conclusions

While this test represents a narrow use case, it provides an insight into the current state of LLM capabilities in code generation tasks. The results suggest that while we're making progress in automated code generation, we're still in a phase where human expertise and iterative development remain crucial for successful outcomes. I find that writing an effective prompt for code generation is very much still programming, except with natural language.

Here are the key findings and their implications:

Model Performance Gap

The still significant difference between Claude 3.5 Sonnet v2 and Google Gemini's successful generation and the lack of success of other models, including those with comparable parameter counts, suggests that the "scaling law" and industry trend toward ever-larger models, is not the sole differentiator for specific model use cases.

Important: I'm too cheap to pay for OpenAI or Microsoft frontier model access. I would be grateful if one of my dear readers would feed the prompt into those models and see what they get.

Zero-Shot Complexity Threshold

This test reveals an interesting threshold in zero-shot code generation capabilities. While most tested models can handle simpler coding tasks (as evidenced by my previous Python code generation tests), the complexity of an entire web application with state management, UI interactions, and data persistence appears to be beyond the current capabilities of most models without iterative refinement.

Development Process Implications

The fact that it took several hours of iterative prompting to arrive at a working solution with Claude suggests that even the most capable models still benefit from human guidance and refinement. This indicates that LLMs are currently best viewed as collaborative tools rather than autonomous developers, even for seemingly straightforward applications.

Engineering Trade-offs

The test highlights an interesting trade-off in prompt engineering: while a single comprehensive prompt might be ideal for reproducibility and testing, it may not be the most effective approach for actually developing applications. The iterative approach that eventually led to the working solution suggests that breaking down complex requirements into smaller, manageable chunks might be more practical for real-world development.

Tarun Tyagi

Cloud Solution Architect @ Microsoft | Cloud and AI Professional

1 个月

insightful stuff Orlando Andico

Shirisha Vivekanand

Lead Solutions Architect at Amazon Web Services (AWS)

1 个月

Nice experiment Orlando Andico !

Col Willis

Solutions Architect - Amazon Web Services (AWS). Spam me with Sales or Training stuff & I'll immediately delete you.

3 个月

This is great Orlando Andico

1 次回应

查看更多评论

要查看或添加评论，请登录

Orlando Andico的更多文章

Comparing Large Language Models for Python Code Generation: Claude 3.5 Sonnet vs Self-Hosted Alternatives

2024年9月26日

Comparing Large Language Models for Python Code Generation: Claude 3.5 Sonnet vs Self-Hosted Alternatives

Update 15-Nov-2024: added Qwen2.5-Coder-32b-Instruct Executive Summary Claude 3.
Offline Inferencing with Ollama and Smaug-72B at 5 tokens/second

2024年3月18日

Offline Inferencing with Ollama and Smaug-72B at 5 tokens/second

TL; DR Run the Smaug-72B Large Language Model locally at 5 tokens/second for under $800 using Ubuntu Linux, Ollama, and…

11 条评论
Ollama Offline Inferencing on Intel, Apple M2, and Nvidia with Codellama-7B

2023年11月17日

Ollama Offline Inferencing on Intel, Apple M2, and Nvidia with Codellama-7B

Ollama (local) offline inferencing was tested with the Codellama-7B 4 bit per weight quantised model on Intel CPU's…

4 条评论
Oracle Linux on AWS Graviton2/3

2023年3月31日

Oracle Linux on AWS Graviton2/3

Oracle client libraries have been accessible on Linux ARM for some time. However, the process of obtaining Oracle Linux…
Inference Performance Across GPU Generations on Stable Diffusion 1.4

2022年9月13日

Inference Performance Across GPU Generations on Stable Diffusion 1.4

Generative AI has exploded in popularity in 2022 after the beta availability of DALL E 2 from OpenAI. Other generative…

2 条评论
Build a hybrid cloud database platform with Azure Arc

2021年6月16日

Build a hybrid cloud database platform with Azure Arc

or, how to build your own on-premise hybrid cloud for coffee money Summary This article will guide you through the…

2 条评论

See all articles

Practical AI Coding Test: Creating a Basic But Useful Web App

Orlando Andico

Cloud Native Databases, Infrastructure, AI | 15x AWS Certified

Update 3-Feb-2025

Update 27-Jan-2025

Introduction

Background and Motivation

Test Methodology

Test Environment

Model Configurations

Evaluation Criteria

Testing Process

领英推荐

Success Criteria

Notes on Methodology Limitations

Disclaimers and Limitations

Development Process

The Challenge Prompt

Models Evaluated

Results and Analysis

Conclusions

Model Performance Gap

Zero-Shot Complexity Threshold

Development Process Implications

Engineering Trade-offs

Orlando Andico的更多文章

社区洞察

其他会员也浏览了

Silicon Synapses: A tale of an impending war between the CPU, GPU and now - The "LPU"

AI/ML Digest | Issue 36

Recall, Pasqal & Pokemon AI

World Models and JEPA: The Next Evolution in AI Architecture

Edge Insights #6 - Python-based NetsPresso?, GenAI ITS Solutions, NVIDIA GTC Spring 2024 and STM & Arm Partnerships.

Apple Joins Open-Source, OpenAI’s Chip Challenge, Tiny Titans: Mistral and Nvidia Reveal Compact AI Powerhouse

The Future of AI: Insights from Eric Schmidt’s Stanford Engineering Lecture

LLM Deep Contextual Retrieval and Multi-Index Chunking: Nvidia PDFs, Case Study

Emergent Algebraic Structures in Digital Logic: A Deep Exploration into NAND Gates Through Group and Category Theories, and Galois Fields

Aleph Alpha: Rome Wasn't Built in a Day—And neither was Pharia 1- LLM-7B model

Update 3-Feb-2025

Update 27-Jan-2025

Introduction

Background and Motivation

Test Methodology

Test Environment

Model Configurations

Evaluation Criteria

Testing Process

领英推荐

Success Criteria

Notes on Methodology Limitations

Disclaimers and Limitations

Development Process

The Challenge Prompt

Models Evaluated

Results and Analysis

Conclusions

Model Performance Gap

Zero-Shot Complexity Threshold

Development Process Implications

Engineering Trade-offs

Orlando Andico的更多文章

Comparing Large Language Models for Python Code Generation: Claude 3.5 Sonnet vs Self-Hosted Alternatives

Offline Inferencing with Ollama and Smaug-72B at 5 tokens/second

Ollama Offline Inferencing on Intel, Apple M2, and Nvidia with Codellama-7B

Oracle Linux on AWS Graviton2/3

Inference Performance Across GPU Generations on Stable Diffusion 1.4

Build a hybrid cloud database platform with Azure Arc

社区洞察

其他会员也浏览了

Silicon Synapses: A tale of an impending war between the CPU, GPU and now - The "LPU"

AI/ML Digest | Issue 36

Recall, Pasqal & Pokemon AI

World Models and JEPA: The Next Evolution in AI Architecture

Edge Insights #6 - Python-based NetsPresso?, GenAI ITS Solutions, NVIDIA GTC Spring 2024 and STM & Arm Partnerships.

Apple Joins Open-Source, OpenAI’s Chip Challenge, Tiny Titans: Mistral and Nvidia Reveal Compact AI Powerhouse

The Future of AI: Insights from Eric Schmidt’s Stanford Engineering Lecture

LLM Deep Contextual Retrieval and Multi-Index Chunking: Nvidia PDFs, Case Study

Emergent Algebraic Structures in Digital Logic: A Deep Exploration into NAND Gates Through Group and Category Theories, and Galois Fields

Aleph Alpha: Rome Wasn't Built in a Day—And neither was Pharia 1- LLM-7B model