Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences
(Also on Constellar tech blog https://nedvedyang.medium.com/geek-out-time-ai-in-the-browser-run-webllm-for-powerful-local-llm-experiences-f89f80c77e78)
WebLLM brings Large Language Models (LLMs) directly into your browser, leveraging WebGPU for on-device GPU computation. In this updated guide, we’ll cover everything from local installation to CDN integration, allowing to run WebLLM on a local mac or directly via online platforms like CodePen or JSFiddle.
1. What is WebLLM?
WebLLM is an open-source project from MLC-AI that enables in-browser execution of large language models (LLMs). It uses WebGPU to harness the power of your GPU, allowing models like LLaMA, Gemma, or Mistral to run efficiently in the browser. Open the page (https://webgpureport.org/) on the browser to see whether WebGPU is supported.
2. Option 1: Local Setup for WebLLM
For a full experience, you can run WebLLM locally.
Step 2.1: Clone the Repository
git clone https://github.com/mlc-ai/web-llm.git
cd web-llm/examples/get-started
Step 2.2: Install Dependencies and start
npm install
npm start
Step 2.3: Edit the code to load the model
Edit get_started.ts to load “gemma-2–2b-it-q4f16_1-MLC”, a much smaller model than llama, which won’t cause the browser to hang on my mac. :-)
const selectedModel = "gemma-2-2b-it-q4f16_1-MLC";
also extract the message to display in the console,
// Extract and print the response text from the first choice
if (reply0.choices && reply0.choices.length > 0) {
console.log('Assistant Response:', reply0.choices[0].message.content);
} else {
console.log('No valid response from the model.');
}
Here is the full code.
import * as webllm from "@mlc-ai/web-llm";
function setLabel(id: string, text: string) {
const label = document.getElementById(id);
if (label == null) {
throw Error("Cannot find label " + id);
}
label.innerText = text;
}
async function main() {
const initProgressCallback = (report: webllm.InitProgressReport) => {
setLabel("init-label", report.text);
};
// Option 1: If we do not specify appConfig, we use `prebuiltAppConfig` defined in `config.ts`
const selectedModel = "gemma-2-2b-it-q4f16_1-MLC";
const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
selectedModel,
{
initProgressCallback: initProgressCallback,
logLevel: "INFO", // specify the log level
},
// customize kv cache, use either context_window_size or sliding_window_size (with attention sink)
{
context_window_size: 2048,
// sliding_window_size: 1024,
// attention_sink_size: 4,
},
);
const reply0 = await engine.chat.completions.create({
messages: [{ role: "user", content: "List three singapore universities." }],
// below configurations are all optional
n: 3,
temperature: 1.5,
max_tokens: 4000,
// 46510 and 7188 are "California", and 8421 and 51325 are "Texas" in Llama-3.1-8B-Instruct
// So we would have a higher chance of seeing the latter two, but never the first in the answer
logit_bias: {
"46510": -100,
"7188": -100,
"8421": 5,
"51325": 5,
},
logprobs: true,
top_logprobs: 2,
});
console.log ('*********');
console.log(reply0);
console.log(reply0.usage);
// Extract and print the response text from the first choice
if (reply0.choices && reply0.choices.length > 0) {
console.log('Assistant Response:', reply0.choices[0].message.content);
} else {
console.log('No valid response from the model.');
}
// To change model, either create a new engine via `CreateMLCEngine()`, or call `engine.reload(modelId)`
}
main();
Step 2.4: Load and Run a Model
Open https://localhost:3000 in the browser and you will see the output.
领英推荐
3. Option 2: CDN Integration for WebLLM
For quick prototyping or embedding in web applications, you can use WebLLM via a CDN.
Step 3.1: Basic HTML Integration
Create a basic HTML file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>WebLLM via CDN</title>
</head>
<body>
<h1>WebLLM Chatbot</h1>
<div id="chat-container"></div>
<script type="module">
import * as webllm from "https://esm.run/@mlc-ai/web-llm"; async function runWebLLM() {
const app = new webllm.ChatApp({
model: 'gemma-2-2b-it-q4f16_1-MLC',
gpuBackend: 'webgpu'
});
await app.start();
console.log("WebLLM Initialized");
}
runWebLLM();
</script>
</body>
</html>
<!doctype html>
<html>
<head>
<title>Simple Chatbot</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta charset="UTF-8" />
<link rel="stylesheet" href="./index.css" />
</head>
<body>
<p>Step 1: Initialize WebLLM and Download Model</p>
<div class="download-container">
<select id="model-selection"></select>
<button id="download">Download</button>
</div>
<p id="download-status" class="hidden"></p>
<p>Step 2: Chat</p>
<div class="chat-container">
<div id="chat-box" class="chat-box"></div>
<div id="chat-stats" class="chat-stats hidden"></div>
<div class="chat-input-container">
<input type="text" id="user-input" placeholder="Type a message..." />
<button id="send" disabled>Send</button>
</div>
</div>
<script type="module">
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
/*************** WebLLM logic ***************/
const messages = [
{
content: "You are a helpful AI agent helping users.",
role: "system",
},
];
const availableModels = webllm.prebuiltAppConfig.model_list.map(
(m) => m.model_id,
);
let selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-1k";
// Callback function for initializing progress
function updateEngineInitProgressCallback(report) {
console.log("initialize", report.progress);
document.getElementById("download-status").textContent = report.text;
}
// Create engine instance
const engine = new webllm.MLCEngine();
engine.setInitProgressCallback(updateEngineInitProgressCallback);
async function initializeWebLLMEngine() {
document.getElementById("download-status").classList.remove("hidden");
selectedModel = document.getElementById("model-selection").value;
const config = {
temperature: 1.0,
top_p: 1,
};
await engine.reload(selectedModel, config);
}
async function streamingGenerating(messages, onUpdate, onFinish, onError) {
try {
let curMessage = "";
let usage;
const completion = await engine.chat.completions.create({
stream: true,
messages,
stream_options: { include_usage: true },
});
for await (const chunk of completion) {
const curDelta = chunk.choices[0]?.delta.content;
if (curDelta) {
curMessage += curDelta;
}
if (chunk.usage) {
usage = chunk.usage;
}
onUpdate(curMessage);
}
const finalMessage = await engine.getMessage();
onFinish(finalMessage, usage);
} catch (err) {
onError(err);
}
}
/*************** UI logic ***************/
function onMessageSend() {
const input = document.getElementById("user-input").value.trim();
const message = {
content: input,
role: "user",
};
if (input.length === 0) {
return;
}
document.getElementById("send").disabled = true;
messages.push(message);
appendMessage(message);
document.getElementById("user-input").value = "";
document
.getElementById("user-input")
.setAttribute("placeholder", "Generating...");
const aiMessage = {
content: "typing...",
role: "assistant",
};
appendMessage(aiMessage);
const onFinishGenerating = (finalMessage, usage) => {
updateLastMessage(finalMessage);
document.getElementById("send").disabled = false;
const usageText =
`prompt_tokens: ${usage.prompt_tokens}, ` +
`completion_tokens: ${usage.completion_tokens}, ` +
`prefill: ${usage.extra.prefill_tokens_per_s.toFixed(4)} tokens/sec, ` +
`decoding: ${usage.extra.decode_tokens_per_s.toFixed(4)} tokens/sec`;
document.getElementById("chat-stats").classList.remove("hidden");
document.getElementById("chat-stats").textContent = usageText;
};
streamingGenerating(
messages,
updateLastMessage,
onFinishGenerating,
console.error,
);
}
function appendMessage(message) {
const chatBox = document.getElementById("chat-box");
const container = document.createElement("div");
container.classList.add("message-container");
const newMessage = document.createElement("div");
newMessage.classList.add("message");
newMessage.textContent = message.content;
if (message.role === "user") {
container.classList.add("user");
} else {
container.classList.add("assistant");
}
container.appendChild(newMessage);
chatBox.appendChild(container);
chatBox.scrollTop = chatBox.scrollHeight; // Scroll to the latest message
}
function updateLastMessage(content) {
const messageDoms = document
.getElementById("chat-box")
.querySelectorAll(".message");
const lastMessageDom = messageDoms[messageDoms.length - 1];
lastMessageDom.textContent = content;
}
/*************** UI binding ***************/
availableModels.forEach((modelId) => {
const option = document.createElement("option");
option.value = modelId;
option.textContent = modelId;
document.getElementById("model-selection").appendChild(option);
});
document.getElementById("model-selection").value = selectedModel;
document.getElementById("download").addEventListener("click", function () {
initializeWebLLMEngine().then(() => {
document.getElementById("send").disabled = false;
});
});
document.getElementById("send").addEventListener("click", function () {
onMessageSend();
});
</script>
</body>
</html>
Step 3.2: Test on Cloud IDEs
Copy the above code into platforms like:
Run the code, and the chatbot will initialize directly in the browser.
4. Online demo
WebLLM also has an online demo page at https://chat.webllm.ai/. It has many models to play with.
5. Thoughts
WebLLM marks an important step toward shifting large language models (LLMs) from centralized cloud servers to end-user devices. By running directly in the browser using WebGPU, WebLLM removes the need for constant server communication, reducing latency, improving privacy, and enabling offline usage. This approach makes it possible to integrate LLMs into a wide range of applications, from browser-based productivity tools and educational assistants to interactive chat interfaces that run entirely on the user’s device.
One significant advantage of WebLLM is data locality — all data processing happens on the user’s device rather than being sent to a remote server. This is especially valuable for applications handling sensitive information, such as healthcare tools, financial assistants, or legal document analysis. By keeping data local, WebLLM not only ensures greater privacy but also complies more easily with data protection regulations in different regions.
The ability to run LLMs locally also means lower operational costs for developers, as reliance on expensive backend infrastructure decreases. Furthermore, it opens up opportunities for AI applications in low-connectivity areas or environments where consistent internet access isn’t guaranteed. In short, WebLLM provides a more scalable, accessible, and secure way to deliver AI-powered experiences, laying the groundwork for broader adoption of LLMs across devices and platforms.
Enjoy coding and have fun!
Final Year Student at Nanyang Technological University, Singapore. Graduating in May 2025. Pursuing Double Degree in Computer Science (specialization in AI) and Business (specialization in Business Analytics)
2 个月Wow, very useful! Thanks for sharing