Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

(Also on Constellar tech blog https://nedvedyang.medium.com/geek-out-time-ai-in-the-browser-run-webllm-for-powerful-local-llm-experiences-f89f80c77e78)

WebLLM brings Large Language Models (LLMs) directly into your browser, leveraging WebGPU for on-device GPU computation. In this updated guide, we’ll cover everything from local installation to CDN integration, allowing to run WebLLM on a local mac or directly via online platforms like CodePen or JSFiddle.

1. What is WebLLM?

WebLLM is an open-source project from MLC-AI that enables in-browser execution of large language models (LLMs). It uses WebGPU to harness the power of your GPU, allowing models like LLaMA, Gemma, or Mistral to run efficiently in the browser. Open the page (https://webgpureport.org/) on the browser to see whether WebGPU is supported.

2. Option 1: Local Setup for WebLLM

For a full experience, you can run WebLLM locally.

Step 2.1: Clone the Repository

git clone https://github.com/mlc-ai/web-llm.git
cd web-llm/examples/get-started        

Step 2.2: Install Dependencies and start

npm install
npm start        

Step 2.3: Edit the code to load the model

Edit get_started.ts to load “gemma-2–2b-it-q4f16_1-MLC”, a much smaller model than llama, which won’t cause the browser to hang on my mac. :-)

const selectedModel = "gemma-2-2b-it-q4f16_1-MLC";        

also extract the message to display in the console,

// Extract and print the response text from the first choice
    if (reply0.choices && reply0.choices.length > 0) {
      console.log('Assistant Response:', reply0.choices[0].message.content);
    } else {
      console.log('No valid response from the model.');
    }        

Here is the full code.

import * as webllm from "@mlc-ai/web-llm";

function setLabel(id: string, text: string) {
  const label = document.getElementById(id);
  if (label == null) {
    throw Error("Cannot find label " + id);
  }
  label.innerText = text;
}

async function main() {
  const initProgressCallback = (report: webllm.InitProgressReport) => {
    setLabel("init-label", report.text);
  };
  // Option 1: If we do not specify appConfig, we use `prebuiltAppConfig` defined in `config.ts`
  const selectedModel = "gemma-2-2b-it-q4f16_1-MLC";
  const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
    selectedModel,
    {
      initProgressCallback: initProgressCallback,
      logLevel: "INFO", // specify the log level
    },
    // customize kv cache, use either context_window_size or sliding_window_size (with attention sink)
    {
      context_window_size: 2048,
      // sliding_window_size: 1024,
      // attention_sink_size: 4,
    },
  );


  const reply0 = await engine.chat.completions.create({
    messages: [{ role: "user", content: "List three singapore universities." }],
    // below configurations are all optional
    n: 3,
    temperature: 1.5,
    max_tokens: 4000,
    // 46510 and 7188 are "California", and 8421 and 51325 are "Texas" in Llama-3.1-8B-Instruct
    // So we would have a higher chance of seeing the latter two, but never the first in the answer
    logit_bias: {
      "46510": -100,
      "7188": -100,
      "8421": 5,
      "51325": 5,
    },
    logprobs: true,
    top_logprobs: 2,
  });
  console.log ('*********');
  console.log(reply0);
  console.log(reply0.usage);
    // Extract and print the response text from the first choice
    if (reply0.choices && reply0.choices.length > 0) {
      console.log('Assistant Response:', reply0.choices[0].message.content);
    } else {
      console.log('No valid response from the model.');
    }

  // To change model, either create a new engine via `CreateMLCEngine()`, or call `engine.reload(modelId)`
}

main();        

Step 2.4: Load and Run a Model

Open https://localhost:3000 in the browser and you will see the output.

3. Option 2: CDN Integration for WebLLM

For quick prototyping or embedding in web applications, you can use WebLLM via a CDN.

Step 3.1: Basic HTML Integration

Create a basic HTML file:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>WebLLM via CDN</title>
</head>

<body>
    <h1>WebLLM Chatbot</h1>
    <div id="chat-container"></div>
    <script type="module">
        import * as webllm from "https://esm.run/@mlc-ai/web-llm"; async function runWebLLM() {
            const app = new webllm.ChatApp({
                model: 'gemma-2-2b-it-q4f16_1-MLC',
                gpuBackend: 'webgpu'
            });
            await app.start();
            console.log("WebLLM Initialized");
        }
        runWebLLM();
    </script>
</body>

</html>

<!doctype html>
<html>

<head>
    <title>Simple Chatbot</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta charset="UTF-8" />
    <link rel="stylesheet" href="./index.css" />
</head>

<body>
    <p>Step 1: Initialize WebLLM and Download Model</p>
    <div class="download-container">
        <select id="model-selection"></select>
        <button id="download">Download</button>
    </div>
    <p id="download-status" class="hidden"></p>

    <p>Step 2: Chat</p>
    <div class="chat-container">
        <div id="chat-box" class="chat-box"></div>
        <div id="chat-stats" class="chat-stats hidden"></div>
        <div class="chat-input-container">
            <input type="text" id="user-input" placeholder="Type a message..." />
            <button id="send" disabled>Send</button>
        </div>
    </div>

    <script type="module">
        import * as webllm from "https://esm.run/@mlc-ai/web-llm";

        /*************** WebLLM logic ***************/
        const messages = [
            {
                content: "You are a helpful AI agent helping users.",
                role: "system",
            },
        ];

        const availableModels = webllm.prebuiltAppConfig.model_list.map(
            (m) => m.model_id,
        );
        let selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-1k";

        // Callback function for initializing progress
        function updateEngineInitProgressCallback(report) {
            console.log("initialize", report.progress);
            document.getElementById("download-status").textContent = report.text;
        }

        // Create engine instance
        const engine = new webllm.MLCEngine();
        engine.setInitProgressCallback(updateEngineInitProgressCallback);

        async function initializeWebLLMEngine() {
            document.getElementById("download-status").classList.remove("hidden");
            selectedModel = document.getElementById("model-selection").value;
            const config = {
                temperature: 1.0,
                top_p: 1,
            };
            await engine.reload(selectedModel, config);
        }

        async function streamingGenerating(messages, onUpdate, onFinish, onError) {
            try {
                let curMessage = "";
                let usage;
                const completion = await engine.chat.completions.create({
                    stream: true,
                    messages,
                    stream_options: { include_usage: true },
                });
                for await (const chunk of completion) {
                    const curDelta = chunk.choices[0]?.delta.content;
                    if (curDelta) {
                        curMessage += curDelta;
                    }
                    if (chunk.usage) {
                        usage = chunk.usage;
                    }
                    onUpdate(curMessage);
                }
                const finalMessage = await engine.getMessage();
                onFinish(finalMessage, usage);
            } catch (err) {
                onError(err);
            }
        }

        /*************** UI logic ***************/
        function onMessageSend() {
            const input = document.getElementById("user-input").value.trim();
            const message = {
                content: input,
                role: "user",
            };
            if (input.length === 0) {
                return;
            }
            document.getElementById("send").disabled = true;

            messages.push(message);
            appendMessage(message);

            document.getElementById("user-input").value = "";
            document
                .getElementById("user-input")
                .setAttribute("placeholder", "Generating...");

            const aiMessage = {
                content: "typing...",
                role: "assistant",
            };
            appendMessage(aiMessage);

            const onFinishGenerating = (finalMessage, usage) => {
                updateLastMessage(finalMessage);
                document.getElementById("send").disabled = false;
                const usageText =
                    `prompt_tokens: ${usage.prompt_tokens}, ` +
                    `completion_tokens: ${usage.completion_tokens}, ` +
                    `prefill: ${usage.extra.prefill_tokens_per_s.toFixed(4)} tokens/sec, ` +
                    `decoding: ${usage.extra.decode_tokens_per_s.toFixed(4)} tokens/sec`;
                document.getElementById("chat-stats").classList.remove("hidden");
                document.getElementById("chat-stats").textContent = usageText;
            };

            streamingGenerating(
                messages,
                updateLastMessage,
                onFinishGenerating,
                console.error,
            );
        }

        function appendMessage(message) {
            const chatBox = document.getElementById("chat-box");
            const container = document.createElement("div");
            container.classList.add("message-container");
            const newMessage = document.createElement("div");
            newMessage.classList.add("message");
            newMessage.textContent = message.content;

            if (message.role === "user") {
                container.classList.add("user");
            } else {
                container.classList.add("assistant");
            }

            container.appendChild(newMessage);
            chatBox.appendChild(container);
            chatBox.scrollTop = chatBox.scrollHeight; // Scroll to the latest message
        }

        function updateLastMessage(content) {
            const messageDoms = document
                .getElementById("chat-box")
                .querySelectorAll(".message");
            const lastMessageDom = messageDoms[messageDoms.length - 1];
            lastMessageDom.textContent = content;
        }

        /*************** UI binding ***************/
        availableModels.forEach((modelId) => {
            const option = document.createElement("option");
            option.value = modelId;
            option.textContent = modelId;
            document.getElementById("model-selection").appendChild(option);
        });
        document.getElementById("model-selection").value = selectedModel;
        document.getElementById("download").addEventListener("click", function () {
            initializeWebLLMEngine().then(() => {
                document.getElementById("send").disabled = false;
            });
        });
        document.getElementById("send").addEventListener("click", function () {
            onMessageSend();
        });
    </script>
</body>

</html>
        

Step 3.2: Test on Cloud IDEs

Copy the above code into platforms like:

Run the code, and the chatbot will initialize directly in the browser.

4. Online demo

WebLLM also has an online demo page at https://chat.webllm.ai/. It has many models to play with.

5. Thoughts

WebLLM marks an important step toward shifting large language models (LLMs) from centralized cloud servers to end-user devices. By running directly in the browser using WebGPU, WebLLM removes the need for constant server communication, reducing latency, improving privacy, and enabling offline usage. This approach makes it possible to integrate LLMs into a wide range of applications, from browser-based productivity tools and educational assistants to interactive chat interfaces that run entirely on the user’s device.

One significant advantage of WebLLM is data locality — all data processing happens on the user’s device rather than being sent to a remote server. This is especially valuable for applications handling sensitive information, such as healthcare tools, financial assistants, or legal document analysis. By keeping data local, WebLLM not only ensures greater privacy but also complies more easily with data protection regulations in different regions.

The ability to run LLMs locally also means lower operational costs for developers, as reliance on expensive backend infrastructure decreases. Furthermore, it opens up opportunities for AI applications in low-connectivity areas or environments where consistent internet access isn’t guaranteed. In short, WebLLM provides a more scalable, accessible, and secure way to deliver AI-powered experiences, laying the groundwork for broader adoption of LLMs across devices and platforms.

Enjoy coding and have fun!

Prateek Anish Potdar

Final Year Student at Nanyang Technological University, Singapore. Graduating in May 2025. Pursuing Double Degree in Computer Science (specialization in AI) and Business (specialization in Business Analytics)

2 个月

Wow, very useful! Thanks for sharing

要查看或添加评论,请登录

Nedved Yang的更多文章

社区洞察

其他会员也浏览了