How to Convert Audio and Video to Text Locally for Free Using Whisper WebGPU
Vladislav G.
Marketing Director & Lead Web Developer | Driving Data-Driven Marketing Strategies with Technical Expertise
In this article, I’m going to show you how you can easily transcribe audio and video files on your own computer using Whisper WebGPU — without needing an internet connection.
Initial Requirements to Run Whisper Model Locally
We are going to build a web application, so you would need:
The basic hardware requirements:
What is Whisper from OpenAI
Whisper is an advanced speech recognition system developed by OpenAI. It’s designed to transcribe spoken language into written text and can also translate different languages. Whisper is known for its accuracy and ability to understand a variety of accents, languages, and even background noise, making it one of the most reliable tools for converting audio to text.
One of the best things about Whisper is that it’s open-source, meaning anyone can access it and use it for free. It can be run on cloud servers or even on your local computer, depending on your needs.
Official website: https://openai.com/index/whisper/
Hugging Face’s Transformers.js and ONNX Runtime Web
We are going to use Whisper WebGPU https://github.com/xenova/whisper-web/tree/experimental-webgpu project.
This project utilizes OpenAI’s Whisper model and runs entirely on your device using WebGPU. It also leverages Hugging Face’s Transformers.js and ONNX Runtime Web, allowing all computations to be performed locally on your device without the need for server-side processing. This means that once the model is loaded, you won’t need an internet connection.
Key Features of Whisper WebGPU:
How to Run Whisper Model Locally (Ubuntu, Linux)
I will show you how to run it on Ubuntu (Linux). However, if you use Windows or Mac, you can follow the same steps inside, but you have to use the terminal.
Step 1. Istall GIT, Node.JS, and NPM
If you are using Ubuntu, Git should be already there. However, if it’s not, use this command:
sudo apt update
sudo apt install git
Install Node.js:
sudo apt install nodejs
Install NPM:
sudo apt install npm
Step 2. Turn on WebGPU in the Browser
Ensure your browser is configured to support WebGPU. Inside address bar in Crome Browser write chrome://flags, then find “Unsafe WebGPU Support” enable it, and relaunch the browser.
领英推荐
This is still an experimental feature in some browsers, so you may need to enable it in browser settings.
You can check the WebGPU status by opening chrome://gpu/ in your browser.
In some cases on Ubuntu, even after relaunching, WebGPU could be disabled. In this case, try to open the browser with the following command:
/opt/google/chrome/chrome --enable-unsafe-webgpu
Step 3. Clone Repository and Install Dependencies
Clone the Whisper WebGPU project by following the command:
git clone https://github.com/xenova/whisper-web.git
Once the cloning process is finished, go inside the folder whisper-web:
cd whisper-web
Then run the following command:
npm install
After that run:
npm run dev
To start a web server. The URL of your web application will be available in the terminal window. E.g. https://localhost:5174/
Step 4. Run the Application
Go to your browser and open the URL from the terminal to see your application.
This web application supports various audio and video formats and even recording from your microphone.
To start the transcription process, simply provide the URL to the audio or upload the video file from your local computer.
Video Tutorial
Watch on YouTube: Audio and Video to text converter.
Conclusion
Whisper WebGPU represents a significant step forward in speech recognition technology by bringing powerful, AI-driven transcription and translation capabilities directly to your browser. By utilizing OpenAI’s Whisper model and advanced tools like WebGPU, Transformers.js, and ONNX Runtime Web, this project makes real-time, offline transcription accessible to everyone while also prioritizing privacy and convenience.
Marketing Director & Lead Web Developer | Driving Data-Driven Marketing Strategies with Technical Expertise
1 个月Real-Time Audio to Text in Your Browser https://youtu.be/YQWNuRTCcUk?si=G35I7B_gb6GYaT9m