登录查看更多内容

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

Nedved Yang

发布日期: 2024年12月23日

(Also on Constellar tech blog https://nedvedyang.medium.com/geek-out-time-ai-in-the-browser-run-webllm-for-powerful-local-llm-experiences-f89f80c77e78)

WebLLM brings Large Language Models (LLMs) directly into your browser, leveraging WebGPU for on-device GPU computation. In this updated guide, we’ll cover everything from local installation to CDN integration, allowing to run WebLLM on a local mac or directly via online platforms like CodePen or JSFiddle.

1. What is WebLLM?

WebLLM is an open-source project from MLC-AI that enables in-browser execution of large language models (LLMs). It uses WebGPU to harness the power of your GPU, allowing models like LLaMA, Gemma, or Mistral to run efficiently in the browser. Open the page (https://webgpureport.org/) on the browser to see whether WebGPU is supported.

2. Option 1: Local Setup for WebLLM

For a full experience, you can run WebLLM locally.

Step 2.1: Clone the Repository

git clone https://github.com/mlc-ai/web-llm.git
cd web-llm/examples/get-started

Step 2.2: Install Dependencies and start

npm install
npm start

Step 2.3: Edit the code to load the model

Edit get_started.ts to load “gemma-2–2b-it-q4f16_1-MLC”, a much smaller model than llama, which won’t cause the browser to hang on my mac. :-)

const selectedModel = "gemma-2-2b-it-q4f16_1-MLC";

also extract the message to display in the console,

// Extract and print the response text from the first choice
    if (reply0.choices && reply0.choices.length > 0) {
      console.log('Assistant Response:', reply0.choices[0].message.content);
    } else {
      console.log('No valid response from the model.');
    }

Here is the full code.

import * as webllm from "@mlc-ai/web-llm";

function setLabel(id: string, text: string) {
  const label = document.getElementById(id);
  if (label == null) {
    throw Error("Cannot find label " + id);
  }
  label.innerText = text;
}

async function main() {
  const initProgressCallback = (report: webllm.InitProgressReport) => {
    setLabel("init-label", report.text);
  };
  // Option 1: If we do not specify appConfig, we use `prebuiltAppConfig` defined in `config.ts`
  const selectedModel = "gemma-2-2b-it-q4f16_1-MLC";
  const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
    selectedModel,
    {
      initProgressCallback: initProgressCallback,
      logLevel: "INFO", // specify the log level
    },
    // customize kv cache, use either context_window_size or sliding_window_size (with attention sink)
    {
      context_window_size: 2048,
      // sliding_window_size: 1024,
      // attention_sink_size: 4,
    },
  );


  const reply0 = await engine.chat.completions.create({
    messages: [{ role: "user", content: "List three singapore universities." }],
    // below configurations are all optional
    n: 3,
    temperature: 1.5,
    max_tokens: 4000,
    // 46510 and 7188 are "California", and 8421 and 51325 are "Texas" in Llama-3.1-8B-Instruct
    // So we would have a higher chance of seeing the latter two, but never the first in the answer
    logit_bias: {
      "46510": -100,
      "7188": -100,
      "8421": 5,
      "51325": 5,
    },
    logprobs: true,
    top_logprobs: 2,
  });
  console.log ('*********');
  console.log(reply0);
  console.log(reply0.usage);
    // Extract and print the response text from the first choice
    if (reply0.choices && reply0.choices.length > 0) {
      console.log('Assistant Response:', reply0.choices[0].message.content);
    } else {
      console.log('No valid response from the model.');
    }

  // To change model, either create a new engine via `CreateMLCEngine()`, or call `engine.reload(modelId)`
}

main();

Step 2.4: Load and Run a Model

Open https://localhost:3000 in the browser and you will see the output.

领英推荐

D.A.D.A.N. on DeepSeek

Keith Buchanan 1 个月前

Web ML Monthly #16: 1 Billion downloads, Jason's…

Jason Mayes 1 年前

Echoes of the Forgotten Code: 21K Codebase Challenge…

Florin Badita 1 年前

3. Option 2: CDN Integration for WebLLM

For quick prototyping or embedding in web applications, you can use WebLLM via a CDN.

Step 3.1: Basic HTML Integration

Create a basic HTML file:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>WebLLM via CDN</title>
</head>

<body>
    <h1>WebLLM Chatbot</h1>
    <div id="chat-container"></div>
    <script type="module">
        import * as webllm from "https://esm.run/@mlc-ai/web-llm"; async function runWebLLM() {
            const app = new webllm.ChatApp({
                model: 'gemma-2-2b-it-q4f16_1-MLC',
                gpuBackend: 'webgpu'
            });
            await app.start();
            console.log("WebLLM Initialized");
        }
        runWebLLM();
    </script>
</body>

</html>

<!doctype html>
<html>

<head>
    <title>Simple Chatbot</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta charset="UTF-8" />
    <link rel="stylesheet" href="./index.css" />
</head>

<body>
    <p>Step 1: Initialize WebLLM and Download Model</p>
    <div class="download-container">
        <select id="model-selection"></select>
        <button id="download">Download</button>
    </div>
    <p id="download-status" class="hidden"></p>

    <p>Step 2: Chat</p>
    <div class="chat-container">
        <div id="chat-box" class="chat-box"></div>
        <div id="chat-stats" class="chat-stats hidden"></div>
        <div class="chat-input-container">
            <input type="text" id="user-input" placeholder="Type a message..." />
            <button id="send" disabled>Send</button>
        </div>
    </div>

    <script type="module">
        import * as webllm from "https://esm.run/@mlc-ai/web-llm";

        /*************** WebLLM logic ***************/
        const messages = [
            {
                content: "You are a helpful AI agent helping users.",
                role: "system",
            },
        ];

        const availableModels = webllm.prebuiltAppConfig.model_list.map(
            (m) => m.model_id,
        );
        let selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-1k";

        // Callback function for initializing progress
        function updateEngineInitProgressCallback(report) {
            console.log("initialize", report.progress);
            document.getElementById("download-status").textContent = report.text;
        }

        // Create engine instance
        const engine = new webllm.MLCEngine();
        engine.setInitProgressCallback(updateEngineInitProgressCallback);

        async function initializeWebLLMEngine() {
            document.getElementById("download-status").classList.remove("hidden");
            selectedModel = document.getElementById("model-selection").value;
            const config = {
                temperature: 1.0,
                top_p: 1,
            };
            await engine.reload(selectedModel, config);
        }

        async function streamingGenerating(messages, onUpdate, onFinish, onError) {
            try {
                let curMessage = "";
                let usage;
                const completion = await engine.chat.completions.create({
                    stream: true,
                    messages,
                    stream_options: { include_usage: true },
                });
                for await (const chunk of completion) {
                    const curDelta = chunk.choices[0]?.delta.content;
                    if (curDelta) {
                        curMessage += curDelta;
                    }
                    if (chunk.usage) {
                        usage = chunk.usage;
                    }
                    onUpdate(curMessage);
                }
                const finalMessage = await engine.getMessage();
                onFinish(finalMessage, usage);
            } catch (err) {
                onError(err);
            }
        }

        /*************** UI logic ***************/
        function onMessageSend() {
            const input = document.getElementById("user-input").value.trim();
            const message = {
                content: input,
                role: "user",
            };
            if (input.length === 0) {
                return;
            }
            document.getElementById("send").disabled = true;

            messages.push(message);
            appendMessage(message);

            document.getElementById("user-input").value = "";
            document
                .getElementById("user-input")
                .setAttribute("placeholder", "Generating...");

            const aiMessage = {
                content: "typing...",
                role: "assistant",
            };
            appendMessage(aiMessage);

            const onFinishGenerating = (finalMessage, usage) => {
                updateLastMessage(finalMessage);
                document.getElementById("send").disabled = false;
                const usageText =
                    `prompt_tokens: ${usage.prompt_tokens}, ` +
                    `completion_tokens: ${usage.completion_tokens}, ` +
                    `prefill: ${usage.extra.prefill_tokens_per_s.toFixed(4)} tokens/sec, ` +
                    `decoding: ${usage.extra.decode_tokens_per_s.toFixed(4)} tokens/sec`;
                document.getElementById("chat-stats").classList.remove("hidden");
                document.getElementById("chat-stats").textContent = usageText;
            };

            streamingGenerating(
                messages,
                updateLastMessage,
                onFinishGenerating,
                console.error,
            );
        }

        function appendMessage(message) {
            const chatBox = document.getElementById("chat-box");
            const container = document.createElement("div");
            container.classList.add("message-container");
            const newMessage = document.createElement("div");
            newMessage.classList.add("message");
            newMessage.textContent = message.content;

            if (message.role === "user") {
                container.classList.add("user");
            } else {
                container.classList.add("assistant");
            }

            container.appendChild(newMessage);
            chatBox.appendChild(container);
            chatBox.scrollTop = chatBox.scrollHeight; // Scroll to the latest message
        }

        function updateLastMessage(content) {
            const messageDoms = document
                .getElementById("chat-box")
                .querySelectorAll(".message");
            const lastMessageDom = messageDoms[messageDoms.length - 1];
            lastMessageDom.textContent = content;
        }

        /*************** UI binding ***************/
        availableModels.forEach((modelId) => {
            const option = document.createElement("option");
            option.value = modelId;
            option.textContent = modelId;
            document.getElementById("model-selection").appendChild(option);
        });
        document.getElementById("model-selection").value = selectedModel;
        document.getElementById("download").addEventListener("click", function () {
            initializeWebLLMEngine().then(() => {
                document.getElementById("send").disabled = false;
            });
        });
        document.getElementById("send").addEventListener("click", function () {
            onMessageSend();
        });
    </script>
</body>

</html>

Step 3.2: Test on Cloud IDEs

Copy the above code into platforms like:

Run the code, and the chatbot will initialize directly in the browser.

4. Online demo

WebLLM also has an online demo page at https://chat.webllm.ai/. It has many models to play with.

5. Thoughts

WebLLM marks an important step toward shifting large language models (LLMs) from centralized cloud servers to end-user devices. By running directly in the browser using WebGPU, WebLLM removes the need for constant server communication, reducing latency, improving privacy, and enabling offline usage. This approach makes it possible to integrate LLMs into a wide range of applications, from browser-based productivity tools and educational assistants to interactive chat interfaces that run entirely on the user’s device.

One significant advantage of WebLLM is data locality — all data processing happens on the user’s device rather than being sent to a remote server. This is especially valuable for applications handling sensitive information, such as healthcare tools, financial assistants, or legal document analysis. By keeping data local, WebLLM not only ensures greater privacy but also complies more easily with data protection regulations in different regions.

The ability to run LLMs locally also means lower operational costs for developers, as reliance on expensive backend infrastructure decreases. Furthermore, it opens up opportunities for AI applications in low-connectivity areas or environments where consistent internet access isn’t guaranteed. In short, WebLLM provides a more scalable, accessible, and secure way to deliver AI-powered experiences, laying the groundwork for broader adoption of LLMs across devices and platforms.

Enjoy coding and have fun!

Prateek Anish Potdar

Final Year Student at Nanyang Technological University, Singapore. Graduating in May 2025. Pursuing Double Degree in Computer Science (specialization in AI) and Business (specialization in Business Analytics)

2 个月

Wow, very useful! Thanks for sharing

1 次回应

要查看或添加评论，请登录

Nedved Yang的更多文章

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

2025年3月17日

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

(Also on Constellar tech blog:…

1 条评论
Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

2025年3月3日

Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

(Also on Constellar tech blog…

2 条评论
Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

2025年2月24日

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

(Also on Constellar tech blog…
Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

2025年2月17日

Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

(Also on Constellar tech blog…

2 条评论
Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

2025年2月10日

Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

(Also on Constellar tech blog…

4 条评论
Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

2025年2月4日

Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

(Also on Constellar tech blog…
Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

2025年1月20日

Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

(Also on Constellar tech blog…

2 条评论
Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

2025年1月13日

Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

(Also on Constellar tech blog…
Geek Out Time: Exploring Opensource AnythingLLM — The All-in-One, Easy AI Platform for Local RAG and Intelligent Agents with Just a Click

2024年12月9日

Geek Out Time: Exploring Opensource AnythingLLM — The All-in-One, Easy AI Platform for Local RAG and Intelligent Agents with Just a Click

(Also on Constellar tech blog…

3 条评论
Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

2024年12月6日

Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

(Also on Constellar tech blog…

See all articles

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

Nedved Yang

1. What is WebLLM?

2. Option 1: Local Setup for WebLLM

Step 2.1: Clone the Repository

Step 2.2: Install Dependencies and start

Step 2.3: Edit the code to load the model

Step 2.4: Load and Run a Model

领英推荐

3. Option 2: CDN Integration for WebLLM

Step 3.1: Basic HTML Integration

Step 3.2: Test on Cloud IDEs

4. Online demo

5. Thoughts

Nedved Yang的更多文章

社区洞察

其他会员也浏览了

Bags of Documents and the Cluster Hypothesis

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

How to Build a GenAI App with Llama Index

Optimizing LLMs: The Dynamic Integration of LangChain and GPTCache

AskItRight: My Journey to Building an AI-Powered PDF Query App (RAG - llama3.1) ??

A Guide to Building Llama 3.1 RAG Applications with TIR AI Studio

Launch your RAG powered ChatBot in Minutes Using MonsterAPI's no-code platform

FLaNK-AIM Weekly 06 May 2024

?? Claude 3.7 Sonnet, Comet by Perplexity, and Deep Research by OpenAI: Three Revolutions in AI, Web Navigation, and Advanced Research ??

?? Run Powerful LLMs Locally on Your Machine! Here's How (Ollama + Enchanted + Ngrok + DeepSeek V3!) ??

1. What is WebLLM?

2. Option 1: Local Setup for WebLLM

Step 2.1: Clone the Repository

Step 2.2: Install Dependencies and start

Step 2.3: Edit the code to load the model

Step 2.4: Load and Run a Model

领英推荐

3. Option 2: CDN Integration for WebLLM

Step 3.1: Basic HTML Integration

Step 3.2: Test on Cloud IDEs

4. Online demo

5. Thoughts

Nedved Yang的更多文章

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

Geek Out Time: Exploring Opensource AnythingLLM — The All-in-One, Easy AI Platform for Local RAG and Intelligent Agents with Just a Click

Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

社区洞察

其他会员也浏览了

Bags of Documents and the Cluster Hypothesis

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

How to Build a GenAI App with Llama Index

Optimizing LLMs: The Dynamic Integration of LangChain and GPTCache

AskItRight: My Journey to Building an AI-Powered PDF Query App (RAG - llama3.1) ??

A Guide to Building Llama 3.1 RAG Applications with TIR AI Studio

Launch your RAG powered ChatBot in Minutes Using MonsterAPI's no-code platform

FLaNK-AIM Weekly 06 May 2024

?? Claude 3.7 Sonnet, Comet by Perplexity, and Deep Research by OpenAI: Three Revolutions in AI, Web Navigation, and Advanced Research ??

?? Run Powerful LLMs Locally on Your Machine! Here's How (Ollama + Enchanted + Ngrok + DeepSeek V3!) ??