Harnessing AI for Long-Form Audio: Building an Agentic Language Coach

Harnessing AI for Long-Form Audio: Building an Agentic Language Coach

Imagine having an AI that can generate entire audiobooks or comprehensive language lessons at the click of a button. That’s the vision behind my latest exploration on this technological aspect, PlayingWithAudio, where I tackled the technical hurdles to create seamless, long-form audio from text. This initiative is part of my broader goal to build an Agentic AI-assisted language coach—one that guides me through mastering the German language. One of my biggest pending personal challenges...

The Motivation: Personalized Language Learning

Language learning is as much about consistency and engagement as it is about effective content delivery. In my journey to learn German, I envisioned an AI tutor capable of delivering not just interactive 1:1 conversations but also rich, non-interactive audio sessions. These sessions could take the form of long lectures or even complete audiobooks, offering a flexible learning tool that adapts to my needs.

However, several challenges stood in the way:

  • API and Model Limitations: Current TTS models and APIs only support about 4,096 characters per request, yielding roughly 4–6 minutes of audio per segment.
  • Rate Limits: With services like TTS and tts-hd capping requests at three per minute, creating long audio sessions became a time-consuming endeavor.
  • Service Expiry and Updates: Although earlier expiry dates for services like TTS posed a risk, recent extensions (now valid until February 1, 2026) have provided the breathing room needed to innovate further.


Overcoming the Challenges: The Technical Approach

To bypass these constraints and achieve long-form audio generation, I developed a workflow that leverages the strengths of modern AI models while sidestepping their limitations:

  • Text Chunking: The process begins by splitting a long text into manageable 4,096-character chunks. This ensures that each segment fits within the bounds of the TTS API limits, allowing us to process extensive texts piece by piece. The "cut" is also done at the end of a sentence to have it avoid any weir cut mid-sentence.

// usage
List<string> segments = SplitTextIntoSegments(longText, MaxCharacters, BufferCharacters);

    private static List<string> SplitTextIntoSegments(string text, int maxChars, int buffer)
    {
        List<string> segments = new();
        string remainingText = text.Trim();

        while (remainingText.Length > maxChars)
        {
            int tentativeLength = maxChars - buffer;
            string segmentCandidate = remainingText.Substring(0, tentativeLength);
            int lastPunctuation = segmentCandidate.LastIndexOfAny(SentenceEndings);
            int splitIndex = (lastPunctuation > 0) ? lastPunctuation + 1 : tentativeLength;
            string segment = remainingText.Substring(0, splitIndex).Trim();
            segments.Add(segment);
            remainingText = remainingText.Substring(splitIndex).Trim();
        }
        if (!string.IsNullOrEmpty(remainingText))
        {
            segments.Add(remainingText);
        }
        return segments;
    }        

  • Audio Generation Using Preview Models: For each text chunk, I utilize advanced preview models—specifically gpt-4o-audio-preview and gpt-4o-mini-audio-preview—to generate high-quality audio. These models are at the cutting edge, offering capabilities that extend beyond traditional TTS systems.

        AzureOpenAIClient azureClient = new(
            new Uri(EnvironmentWellKnown.Gpt4oAudioEndpoint),
            new AzureKeyCredential(EnvironmentWellKnown.Gpt4oAudioApiKey),
            new AzureOpenAIClientOptions(AzureOpenAIClientOptions.ServiceVersion.V2025_01_01_Preview)
        );
        ChatClient chatClient = azureClient.GetChatClient(EnvironmentWellKnown.Gpt4oAudioDeploymentName);

            ChatCompletionOptions options = new()
            {
                ResponseModalities = ChatResponseModalities.Text | ChatResponseModalities.Audio,
                AudioOptions = new (
                    ChatOutputAudioVoice.Alloy, 
                    ChatOutputAudioFormat.Mp3)
            };

            ChatCompletion completion = await chatClient.CompleteChatAsync(messages, options);
            if (completion.OutputAudio is ChatOutputAudio outputAudio)
            {
                var audioByteArray = outputAudio.AudioBytes.ToArray();
                audioByteSegments.Add(audioByteArray);
                Console.WriteLine($"Segment {index + 1} received: {audioByteArray.Length} bytes.");
            }        

  • Seamless Stitching: Once the individual audio segments are generated, they are stitched together to form a continuous audio stream. This approach not only overcomes the character limit but also maintains the natural flow of the narration, ensuring a smooth listening experience.

// usage
byte[] mergedAudio = MergeMp3ByteArrays(audioByteSegments);

    private static byte[] MergeMp3ByteArrays(List<byte[]> mp3ByteArrays)
    {
        using (var outputStream = new MemoryStream())
        {
            foreach (var mp3Bytes in mp3ByteArrays)
            {
                using (var ms = new MemoryStream(mp3Bytes))
                using (var reader = new Mp3FileReader(ms))
                {
                    Mp3Frame frame;
                    while ((frame = reader.ReadNextFrame()) != null)
                    {
                        outputStream.Write(frame.RawData, 0, frame.RawData.Length);
                    }
                }
            }
            return outputStream.ToArray();
        }
    }        

I did approach this technique with Semantic Kernel too, but only tts and tts-hd are supported, with the biggest limitation being the cap of 3 x minute

The complete solution is open source and available in my PlayingWithAudio GitHub repository, offering a fully working example for anyone dreaming of creating AI-generated audiobooks or extensive language lessons.

If you like what you see, clone at will and give it a star if you think it deserves it ;).


The Bigger Picture: Future Applications

While this project is a stepping stone towards building my Agentic AI-assisted language coach, its implications are far-reaching:

  • Automated Audiobook Production: By overcoming the character and rate limitations, the technology paves the way for generating complete audiobooks with minimal manual intervention.
  • Personalized Learning Experiences: Imagine receiving tailored audio lessons that adapt to your progress and learning style. This system could revolutionize language education and beyond.
  • Expanded Use Cases: Beyond language coaching, long-form audio generation has applications in creating podcasts, storytelling, meditation guides, and more—empowering content creators across various fields.

(but don't expect a song from those models, they will "just" speak well the languages in the provided voice - Ah, but we can mix-in different languages which makes them suitable for my German coaching goal!

The ideas behind this project also resonate with broader discussions in the developer community, highlighting a growing interest in more flexible and agentic AI solutions.


Conclusion

PlayingWithAudio is more than just a technical experiment—it's a glimpse into the future of AI-powered audio content. By addressing the limitations of traditional TTS systems, this project unlocks the potential for fully automated, long-form audio creation. Whether you’re a developer, an educator, or simply passionate about innovative tech, this approach opens up exciting new possibilities for personalized and engaging audio experiences.

Explore the project on GitHub and join the conversation: PlayingWithAudio on GitHub

Let’s push the boundaries of what AI can do—one audio segment at a time.


Thanks!!

I'd like to thank Roger Barreto from the Semantic Kernel team for his insights on using the latest Azure OpenAI Sdk to leverage the gpt-4o-audio-preview, along a couple of tips! Thx Roger! - this discussion started this if you are curious: https://github.com/microsoft/semantic-kernel/discussions/10645


Next Steps

Roger also suggested that this could end in a PR with support for the latest preview models for voice generation - I would be happy to give back if I get some kickstart on how to approach this ;)


Curious about how Generative and Agentic AI are shaping the future? maybe along Semantic Kernel and AutoGen?

Follow José Luis Latorre for real insights and practical examples of these technologies in action.


Jose Luis Latorre

IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker

2 周
Jose Luis Latorre

IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker

3 周

So far close to 100 views of the repo, 16 clones but... no stars! come on they're free ;) ?? ??

Rubén Toribio Gallardo

Microsoft 365 & Azure AI Architect / Full Stack Developer at DNV

3 周

Jose Luis Latorre check the article's link, because it looks like is broken ??

要查看或添加评论,请登录

Jose Luis Latorre的更多文章

社区洞察

其他会员也浏览了