Weekend project: Local AI Speech-to-text using Whisper with Java

Weekend project: Local AI Speech-to-text using Whisper with Java

A short article to showcase how easy it is to transcribe text locally using the WhisperJNI library in Java.

First, download a model from here (the bigger the better, use "en" models for transcribing English).

Then add the library to your pom and use the following code :

WhisperJNI.loadLibrary();
WhisperJNI whisper = new WhisperJNI(); 

float[] samples = readFile(); // see next section

WhisperContext ctx = whisper.init(Path.of("C:\\ggml-small-q5_1.bin")); // Path to the MODEL!
if (ctx == null) { throw new RuntimeException("Failed to initialize Whisper context"); }

WhisperFullParams params = new WhisperFullParams();

int result = whisper.full(ctx, params, samples, samples.length);
if (result != 0) { throw new RuntimeException("Transcription failed with code " + result); }

int numSegments = whisper.fullNSegments(ctx); // Get the number of segments
System.out.println("Number of segments: " + numSegments);
StringBuilder transcribedText = new StringBuilder();
for (int i = 0; i < numSegments; i++) {
    transcribedText.append(whisper.fullGetSegmentText(ctx, i)).append(" ");
}
System.out.println("Transcribed text: " + transcribedText.toString());
ctx.close();          

The code to read the audio file is also quite short:

private static float[] readFile() throws UnsupportedAudioFileException, IOException, URISyntaxException {
   AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(new File(
		Main.class.getClassLoader().getResource("moonfull.wav").getFile()
		));
   ByteBuffer captureBuffer = ByteBuffer.allocate(audioInputStream.available());
   captureBuffer.order(ByteOrder.LITTLE_ENDIAN);
   int read = audioInputStream.read(captureBuffer.array());
   ShortBuffer shortBuffer = captureBuffer.asShortBuffer();
   float[] samples = new float[captureBuffer.capacity() / 2];
   int i = 0;
   while (shortBuffer.hasRemaining()) {
	samples[i++] = Float.max(-1f, Float.min(((float) shortBuffer.get()) / (float) Short.MAX_VALUE, 1f));
   }
return samples;
}        

And voilà, we have a transcription:

You can find more usage examples in the WhisperJNI GitHub.

But wait, what about the file format? Surely we can't use any file, right? Well... you can use the ffmpeg-cli-wrapper to convert almost any files to the correct format:

private static void convert() throws IOException {
FFmpeg ffmpeg = new FFmpeg("C:\\ffmpegbin\\ffmpeg.exe");
FFprobe ffprobe = new FFprobe(Main.class.getClassLoader().getResource("C:\\ffmpegbin\\ffprobe.exe").getFile());

FFmpegBuilder builder = new FFmpegBuilder()
.overrideOutputFiles(true)
.setInput("C:\\Users\\User\\Downloads\\whatsapp.ogg")
.overrideOutputFiles(true)
.addOutput("C:\\Users\\User\\Downloads\\whatsapp.wav")
.setStrict(FFmpegBuilder.Strict.EXPERIMENTAL)
.setFormat("wav")
.setAudioSampleRate(44100)
.setAudioCodec("pcm_s16le")
.disableVideo()
.done();

FFmpegExecutor executor = new FFmpegExecutor(ffmpeg, ffprobe);

// Run a one-pass encode
executor.createJob(builder).run();
}        

I couldn't resist throwing a quick GUI together for easy use:

You'll find the project here, packaged for Windows and Linux: https://github.com/marl1/WhisperTextExtractor

Thanks to all the open source contributors, especially ggerganov for whisper.cpp, GiviMAD for WhisperjJNI and bramp for the FFmpeg wrapper.

(I'm looking for a job in Singapore! Please contact me if you have a lead thank you!)

要查看或添加评论,请登录