Weekend project: Local AI Speech-to-text using Whisper with Java
A short article to showcase how easy it is to transcribe text locally using the WhisperJNI library in Java.
First, download a model from here (the bigger the better, use "en" models for transcribing English).
Then add the library to your pom and use the following code :
WhisperJNI.loadLibrary();
WhisperJNI whisper = new WhisperJNI();
float[] samples = readFile(); // see next section
WhisperContext ctx = whisper.init(Path.of("C:\\ggml-small-q5_1.bin")); // Path to the MODEL!
if (ctx == null) { throw new RuntimeException("Failed to initialize Whisper context"); }
WhisperFullParams params = new WhisperFullParams();
int result = whisper.full(ctx, params, samples, samples.length);
if (result != 0) { throw new RuntimeException("Transcription failed with code " + result); }
int numSegments = whisper.fullNSegments(ctx); // Get the number of segments
System.out.println("Number of segments: " + numSegments);
StringBuilder transcribedText = new StringBuilder();
for (int i = 0; i < numSegments; i++) {
transcribedText.append(whisper.fullGetSegmentText(ctx, i)).append(" ");
}
System.out.println("Transcribed text: " + transcribedText.toString());
ctx.close();
The code to read the audio file is also quite short:
private static float[] readFile() throws UnsupportedAudioFileException, IOException, URISyntaxException {
AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(new File(
Main.class.getClassLoader().getResource("moonfull.wav").getFile()
));
ByteBuffer captureBuffer = ByteBuffer.allocate(audioInputStream.available());
captureBuffer.order(ByteOrder.LITTLE_ENDIAN);
int read = audioInputStream.read(captureBuffer.array());
ShortBuffer shortBuffer = captureBuffer.asShortBuffer();
float[] samples = new float[captureBuffer.capacity() / 2];
int i = 0;
while (shortBuffer.hasRemaining()) {
samples[i++] = Float.max(-1f, Float.min(((float) shortBuffer.get()) / (float) Short.MAX_VALUE, 1f));
}
return samples;
}
And voilà, we have a transcription:
You can find more usage examples in the WhisperJNI GitHub.
But wait, what about the file format? Surely we can't use any file, right? Well... you can use the ffmpeg-cli-wrapper to convert almost any files to the correct format:
private static void convert() throws IOException {
FFmpeg ffmpeg = new FFmpeg("C:\\ffmpegbin\\ffmpeg.exe");
FFprobe ffprobe = new FFprobe(Main.class.getClassLoader().getResource("C:\\ffmpegbin\\ffprobe.exe").getFile());
FFmpegBuilder builder = new FFmpegBuilder()
.overrideOutputFiles(true)
.setInput("C:\\Users\\User\\Downloads\\whatsapp.ogg")
.overrideOutputFiles(true)
.addOutput("C:\\Users\\User\\Downloads\\whatsapp.wav")
.setStrict(FFmpegBuilder.Strict.EXPERIMENTAL)
.setFormat("wav")
.setAudioSampleRate(44100)
.setAudioCodec("pcm_s16le")
.disableVideo()
.done();
FFmpegExecutor executor = new FFmpegExecutor(ffmpeg, ffprobe);
// Run a one-pass encode
executor.createJob(builder).run();
}
I couldn't resist throwing a quick GUI together for easy use:
You'll find the project here, packaged for Windows and Linux: https://github.com/marl1/WhisperTextExtractor
Thanks to all the open source contributors, especially ggerganov for whisper.cpp, GiviMAD for WhisperjJNI and bramp for the FFmpeg wrapper.
(I'm looking for a job in Singapore! Please contact me if you have a lead thank you!)