登录查看更多内容

Harnessing AI for Long-Form Audio: Building an Agentic Language Coach

Jose Luis Latorre

IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker

发布日期: 2025年2月26日

Imagine having an AI that can generate entire audiobooks or comprehensive language lessons at the click of a button. That’s the vision behind my latest exploration on this technological aspect, PlayingWithAudio, where I tackled the technical hurdles to create seamless, long-form audio from text. This initiative is part of my broader goal to build an Agentic AI-assisted language coach—one that guides me through mastering the German language. One of my biggest pending personal challenges...

The Motivation: Personalized Language Learning

Language learning is as much about consistency and engagement as it is about effective content delivery. In my journey to learn German, I envisioned an AI tutor capable of delivering not just interactive 1:1 conversations but also rich, non-interactive audio sessions. These sessions could take the form of long lectures or even complete audiobooks, offering a flexible learning tool that adapts to my needs.

However, several challenges stood in the way:

API and Model Limitations: Current TTS models and APIs only support about 4,096 characters per request, yielding roughly 4–6 minutes of audio per segment.
Rate Limits: With services like TTS and tts-hd capping requests at three per minute, creating long audio sessions became a time-consuming endeavor.
Service Expiry and Updates: Although earlier expiry dates for services like TTS posed a risk, recent extensions (now valid until February 1, 2026) have provided the breathing room needed to innovate further.

Overcoming the Challenges: The Technical Approach

To bypass these constraints and achieve long-form audio generation, I developed a workflow that leverages the strengths of modern AI models while sidestepping their limitations:

Text Chunking: The process begins by splitting a long text into manageable 4,096-character chunks. This ensures that each segment fits within the bounds of the TTS API limits, allowing us to process extensive texts piece by piece. The "cut" is also done at the end of a sentence to have it avoid any weir cut mid-sentence.

// usage
List<string> segments = SplitTextIntoSegments(longText, MaxCharacters, BufferCharacters);

    private static List<string> SplitTextIntoSegments(string text, int maxChars, int buffer)
    {
        List<string> segments = new();
        string remainingText = text.Trim();

        while (remainingText.Length > maxChars)
        {
            int tentativeLength = maxChars - buffer;
            string segmentCandidate = remainingText.Substring(0, tentativeLength);
            int lastPunctuation = segmentCandidate.LastIndexOfAny(SentenceEndings);
            int splitIndex = (lastPunctuation > 0) ? lastPunctuation + 1 : tentativeLength;
            string segment = remainingText.Substring(0, splitIndex).Trim();
            segments.Add(segment);
            remainingText = remainingText.Substring(splitIndex).Trim();
        }
        if (!string.IsNullOrEmpty(remainingText))
        {
            segments.Add(remainingText);
        }
        return segments;
    }

Audio Generation Using Preview Models: For each text chunk, I utilize advanced preview models—specifically gpt-4o-audio-preview and gpt-4o-mini-audio-preview—to generate high-quality audio. These models are at the cutting edge, offering capabilities that extend beyond traditional TTS systems.

        AzureOpenAIClient azureClient = new(
            new Uri(EnvironmentWellKnown.Gpt4oAudioEndpoint),
            new AzureKeyCredential(EnvironmentWellKnown.Gpt4oAudioApiKey),
            new AzureOpenAIClientOptions(AzureOpenAIClientOptions.ServiceVersion.V2025_01_01_Preview)
        );
        ChatClient chatClient = azureClient.GetChatClient(EnvironmentWellKnown.Gpt4oAudioDeploymentName);

            ChatCompletionOptions options = new()
            {
                ResponseModalities = ChatResponseModalities.Text | ChatResponseModalities.Audio,
                AudioOptions = new (
                    ChatOutputAudioVoice.Alloy, 
                    ChatOutputAudioFormat.Mp3)
            };

            ChatCompletion completion = await chatClient.CompleteChatAsync(messages, options);
            if (completion.OutputAudio is ChatOutputAudio outputAudio)
            {
                var audioByteArray = outputAudio.AudioBytes.ToArray();
                audioByteSegments.Add(audioByteArray);
                Console.WriteLine($"Segment {index + 1} received: {audioByteArray.Length} bytes.");
            }

Seamless Stitching: Once the individual audio segments are generated, they are stitched together to form a continuous audio stream. This approach not only overcomes the character limit but also maintains the natural flow of the narration, ensuring a smooth listening experience.

// usage
byte[] mergedAudio = MergeMp3ByteArrays(audioByteSegments);

    private static byte[] MergeMp3ByteArrays(List<byte[]> mp3ByteArrays)
    {
        using (var outputStream = new MemoryStream())
        {
            foreach (var mp3Bytes in mp3ByteArrays)
            {
                using (var ms = new MemoryStream(mp3Bytes))
                using (var reader = new Mp3FileReader(ms))
                {
                    Mp3Frame frame;
                    while ((frame = reader.ReadNextFrame()) != null)
                    {
                        outputStream.Write(frame.RawData, 0, frame.RawData.Length);
                    }
                }
            }
            return outputStream.ToArray();
        }
    }

I did approach this technique with Semantic Kernel too, but only tts and tts-hd are supported, with the biggest limitation being the cap of 3 x minute

The complete solution is open source and available in my PlayingWithAudio GitHub repository, offering a fully working example for anyone dreaming of creating AI-generated audiobooks or extensive language lessons.

If you like what you see, clone at will and give it a star if you think it deserves it ;).

领英推荐

Choosing the best AI tools for language learning

Cambridge English 3 个月前

AI in Language Translation: Will It Replace Human…

Analytics Insight? 5 个月前

Empowering Localization: Unveiling the Power of Stop…

Víctor Parra García 1 年前

The Bigger Picture: Future Applications

While this project is a stepping stone towards building my Agentic AI-assisted language coach, its implications are far-reaching:

Automated Audiobook Production: By overcoming the character and rate limitations, the technology paves the way for generating complete audiobooks with minimal manual intervention.
Personalized Learning Experiences: Imagine receiving tailored audio lessons that adapt to your progress and learning style. This system could revolutionize language education and beyond.
Expanded Use Cases: Beyond language coaching, long-form audio generation has applications in creating podcasts, storytelling, meditation guides, and more—empowering content creators across various fields.

(but don't expect a song from those models, they will "just" speak well the languages in the provided voice - Ah, but we can mix-in different languages which makes them suitable for my German coaching goal!

The ideas behind this project also resonate with broader discussions in the developer community, highlighting a growing interest in more flexible and agentic AI solutions.

Conclusion

PlayingWithAudio is more than just a technical experiment—it's a glimpse into the future of AI-powered audio content. By addressing the limitations of traditional TTS systems, this project unlocks the potential for fully automated, long-form audio creation. Whether you’re a developer, an educator, or simply passionate about innovative tech, this approach opens up exciting new possibilities for personalized and engaging audio experiences.

Explore the project on GitHub and join the conversation: PlayingWithAudio on GitHub

Let’s push the boundaries of what AI can do—one audio segment at a time.

Thanks!!

I'd like to thank Roger Barreto from the Semantic Kernel team for his insights on using the latest Azure OpenAI Sdk to leverage the gpt-4o-audio-preview, along a couple of tips! Thx Roger! - this discussion started this if you are curious: https://github.com/microsoft/semantic-kernel/discussions/10645

Next Steps

Roger also suggested that this could end in a PR with support for the latest preview models for voice generation - I would be happy to give back if I get some kickstart on how to approach this ;)

Curious about how Generative and Agentic AI are shaping the future? maybe along Semantic Kernel and AutoGen?

Follow José Luis Latorre for real insights and practical examples of these technologies in action.

Jose Luis Latorre

2 周

cc Roger Barreto ??

1 次回应

Jose Luis Latorre

3 周

So far close to 100 views of the repo, 16 clones but... no stars! come on they're free ;) ?? ??

2 次回应

Rubén Toribio Gallardo

Microsoft 365 & Azure AI Architect / Full Stack Developer at DNV

3 周

Jose Luis Latorre check the article's link, because it looks like is broken ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Jose Luis Latorre的更多文章

StructRAG Explained: Revolutionizing Structured Data Reasoning

2025年1月12日

StructRAG Explained: Revolutionizing Structured Data Reasoning

The field of AI has been revolutionized by Retrieval-Augmented Generation (RAG) techniques, enabling models to combine…

9 条评论
2024: A Year of Challenges, Growth, and Gratitude

2024年12月24日

2024: A Year of Challenges, Growth, and Gratitude

As 2024 comes to a close, I find myself reflecting on an incredible journey—one that began with a few personal and…

10 条评论
Semantic Kernel: Contributing to a Simpler, More Fluent Process Framework ??

2024年12月17日

Semantic Kernel: Contributing to a Simpler, More Fluent Process Framework ??

The Story: How It Started Back in mid-September, I had the privilege of being introduced to the Process Framework in…

7 条评论
A New Era in AI Agentic Architectures: What Does It Mean for Developers?

2024年11月16日

A New Era in AI Agentic Architectures: What Does It Mean for Developers?

Event-driven, distributed, composable, flexible, observable, and scalable—these aren’t just buzzwords; they are the…

10 条评论
Public Commitment and the Journey to Build My AI Speaker Assistant

2024年10月12日

Public Commitment and the Journey to Build My AI Speaker Assistant

There’s a technique I often use when I need extra motivation to make real progress on an idea or project. It’s the…

2 条评论
OpenAI DevDay 2024: Realtime API Unveiled—Revolutionary, But Worth the Price?"

2024年10月2日

OpenAI DevDay 2024: Realtime API Unveiled—Revolutionary, But Worth the Price?"

“Finally, natural interactions with AI are here—but at a price that might make you think twice.” At OpenAI's DevDay…

1 条评论
Introducing the Semantic Kernel Process Library: A New Era of AI Workflow Orchestration

2024年9月25日

Introducing the Semantic Kernel Process Library: A New Era of AI Workflow Orchestration

In the ever-evolving field of artificial intelligence, the need for robust, scalable, and flexible workflows is…

4 条评论
Supercharging Semantic Kernel with AutoGen: Integrating the "best of both worlds" for Advanced AI Workflows

2024年9月25日

Supercharging Semantic Kernel with AutoGen: Integrating the "best of both worlds" for Advanced AI Workflows

In the rapidly evolving landscape of artificial intelligence, combining powerful tools can unlock unprecedented…

7 条评论
Extending Semantic Kernel with Agentic AI Workflows: New Patterns for Chat Automation

2024年9月15日

Extending Semantic Kernel with Agentic AI Workflows: New Patterns for Chat Automation

In the rapidly evolving world of AI, agentic AI workflows are becoming the backbone of automation and advanced…

3 条评论
Comparing the OpenAI API (Beta 2) Library and the Semantic Kernel SDK

2024年6月7日

Comparing the OpenAI API (Beta 2) Library and the Semantic Kernel SDK

Preface I just announced the OpenAI API Library Beta 2 launch (like 16 hours ago) and I've got some questions on what…

7 条评论

See all articles

Harnessing AI for Long-Form Audio: Building an Agentic Language Coach

Jose Luis Latorre

IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker

The Motivation: Personalized Language Learning

Overcoming the Challenges: The Technical Approach

领英推荐

The Bigger Picture: Future Applications

Conclusion

Thanks!!

Next Steps

Jose Luis Latorre的更多文章

社区洞察

其他会员也浏览了

Generative AI and Large Language Models (LLMs) in Translation: Navigating the Future of Localization

Unlock The Power Of Creative Writing With Google Bard

The Future of Multilingual AI: Breaking Language Barriers!

AI Tech-Enabled Solutions: The Future of Translation and Localization

Unleashing the Power of AI: How Neural Machine Translation is Revolutionizing Communication

Language language Software Market is Booming with Strong Growth Prospects

How to Generate Multilingual Subtitles with AI

The Creative Revolution: Exploring the Frontier of Large Language Models

Deep Dive on Text-to-Speech (TTS) Synthesis

Microlearning Localization with AI and Human Linguistic Quality Assurance

The Motivation: Personalized Language Learning

Overcoming the Challenges: The Technical Approach

领英推荐

The Bigger Picture: Future Applications

Conclusion

Thanks!!

Next Steps

Jose Luis Latorre的更多文章

StructRAG Explained: Revolutionizing Structured Data Reasoning

2024: A Year of Challenges, Growth, and Gratitude

Semantic Kernel: Contributing to a Simpler, More Fluent Process Framework ??

A New Era in AI Agentic Architectures: What Does It Mean for Developers?

Public Commitment and the Journey to Build My AI Speaker Assistant

OpenAI DevDay 2024: Realtime API Unveiled—Revolutionary, But Worth the Price?"

Introducing the Semantic Kernel Process Library: A New Era of AI Workflow Orchestration

Supercharging Semantic Kernel with AutoGen: Integrating the "best of both worlds" for Advanced AI Workflows

Extending Semantic Kernel with Agentic AI Workflows: New Patterns for Chat Automation

Comparing the OpenAI API (Beta 2) Library and the Semantic Kernel SDK

社区洞察

其他会员也浏览了

Generative AI and Large Language Models (LLMs) in Translation: Navigating the Future of Localization

Unlock The Power Of Creative Writing With Google Bard

The Future of Multilingual AI: Breaking Language Barriers!

AI Tech-Enabled Solutions: The Future of Translation and Localization

Unleashing the Power of AI: How Neural Machine Translation is Revolutionizing Communication

Language language Software Market is Booming with Strong Growth Prospects

How to Generate Multilingual Subtitles with AI

The Creative Revolution: Exploring the Frontier of Large Language Models

Deep Dive on Text-to-Speech (TTS) Synthesis

Microlearning Localization with AI and Human Linguistic Quality Assurance