Harnessing AI for Long-Form Audio: Building an Agentic Language Coach
Jose Luis Latorre
IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker
Imagine having an AI that can generate entire audiobooks or comprehensive language lessons at the click of a button. That’s the vision behind my latest exploration on this technological aspect, PlayingWithAudio, where I tackled the technical hurdles to create seamless, long-form audio from text. This initiative is part of my broader goal to build an Agentic AI-assisted language coach—one that guides me through mastering the German language. One of my biggest pending personal challenges...
The Motivation: Personalized Language Learning
Language learning is as much about consistency and engagement as it is about effective content delivery. In my journey to learn German, I envisioned an AI tutor capable of delivering not just interactive 1:1 conversations but also rich, non-interactive audio sessions. These sessions could take the form of long lectures or even complete audiobooks, offering a flexible learning tool that adapts to my needs.
However, several challenges stood in the way:
Overcoming the Challenges: The Technical Approach
To bypass these constraints and achieve long-form audio generation, I developed a workflow that leverages the strengths of modern AI models while sidestepping their limitations:
// usage
List<string> segments = SplitTextIntoSegments(longText, MaxCharacters, BufferCharacters);
private static List<string> SplitTextIntoSegments(string text, int maxChars, int buffer)
{
List<string> segments = new();
string remainingText = text.Trim();
while (remainingText.Length > maxChars)
{
int tentativeLength = maxChars - buffer;
string segmentCandidate = remainingText.Substring(0, tentativeLength);
int lastPunctuation = segmentCandidate.LastIndexOfAny(SentenceEndings);
int splitIndex = (lastPunctuation > 0) ? lastPunctuation + 1 : tentativeLength;
string segment = remainingText.Substring(0, splitIndex).Trim();
segments.Add(segment);
remainingText = remainingText.Substring(splitIndex).Trim();
}
if (!string.IsNullOrEmpty(remainingText))
{
segments.Add(remainingText);
}
return segments;
}
AzureOpenAIClient azureClient = new(
new Uri(EnvironmentWellKnown.Gpt4oAudioEndpoint),
new AzureKeyCredential(EnvironmentWellKnown.Gpt4oAudioApiKey),
new AzureOpenAIClientOptions(AzureOpenAIClientOptions.ServiceVersion.V2025_01_01_Preview)
);
ChatClient chatClient = azureClient.GetChatClient(EnvironmentWellKnown.Gpt4oAudioDeploymentName);
ChatCompletionOptions options = new()
{
ResponseModalities = ChatResponseModalities.Text | ChatResponseModalities.Audio,
AudioOptions = new (
ChatOutputAudioVoice.Alloy,
ChatOutputAudioFormat.Mp3)
};
ChatCompletion completion = await chatClient.CompleteChatAsync(messages, options);
if (completion.OutputAudio is ChatOutputAudio outputAudio)
{
var audioByteArray = outputAudio.AudioBytes.ToArray();
audioByteSegments.Add(audioByteArray);
Console.WriteLine($"Segment {index + 1} received: {audioByteArray.Length} bytes.");
}
// usage
byte[] mergedAudio = MergeMp3ByteArrays(audioByteSegments);
private static byte[] MergeMp3ByteArrays(List<byte[]> mp3ByteArrays)
{
using (var outputStream = new MemoryStream())
{
foreach (var mp3Bytes in mp3ByteArrays)
{
using (var ms = new MemoryStream(mp3Bytes))
using (var reader = new Mp3FileReader(ms))
{
Mp3Frame frame;
while ((frame = reader.ReadNextFrame()) != null)
{
outputStream.Write(frame.RawData, 0, frame.RawData.Length);
}
}
}
return outputStream.ToArray();
}
}
I did approach this technique with Semantic Kernel too, but only tts and tts-hd are supported, with the biggest limitation being the cap of 3 x minute
The complete solution is open source and available in my PlayingWithAudio GitHub repository, offering a fully working example for anyone dreaming of creating AI-generated audiobooks or extensive language lessons.
If you like what you see, clone at will and give it a star if you think it deserves it ;).
领英推荐
The Bigger Picture: Future Applications
While this project is a stepping stone towards building my Agentic AI-assisted language coach, its implications are far-reaching:
(but don't expect a song from those models, they will "just" speak well the languages in the provided voice - Ah, but we can mix-in different languages which makes them suitable for my German coaching goal!
The ideas behind this project also resonate with broader discussions in the developer community, highlighting a growing interest in more flexible and agentic AI solutions.
Conclusion
PlayingWithAudio is more than just a technical experiment—it's a glimpse into the future of AI-powered audio content. By addressing the limitations of traditional TTS systems, this project unlocks the potential for fully automated, long-form audio creation. Whether you’re a developer, an educator, or simply passionate about innovative tech, this approach opens up exciting new possibilities for personalized and engaging audio experiences.
Explore the project on GitHub and join the conversation: PlayingWithAudio on GitHub
Let’s push the boundaries of what AI can do—one audio segment at a time.
Thanks!!
I'd like to thank Roger Barreto from the Semantic Kernel team for his insights on using the latest Azure OpenAI Sdk to leverage the gpt-4o-audio-preview, along a couple of tips! Thx Roger! - this discussion started this if you are curious: https://github.com/microsoft/semantic-kernel/discussions/10645
Next Steps
Roger also suggested that this could end in a PR with support for the latest preview models for voice generation - I would be happy to give back if I get some kickstart on how to approach this ;)
Curious about how Generative and Agentic AI are shaping the future? maybe along Semantic Kernel and AutoGen?
Follow José Luis Latorre for real insights and practical examples of these technologies in action.
IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker
2 周cc Roger Barreto ??
IT & Dev Community Lead & Software Architect at Swiss Life AG | Generative AI & Agentic AI Engineer & Enthusiast | LinkedIn Learning Course Author | Helping people understand and apply AI | Microsoft AI MVP | Speaker
3 周So far close to 100 views of the repo, 16 clones but... no stars! come on they're free ;) ?? ??
Microsoft 365 & Azure AI Architect / Full Stack Developer at DNV
3 周Jose Luis Latorre check the article's link, because it looks like is broken ??