A simulated interview with Andrew Ng about Voice GenAI

A simulated interview with Andrew Ng about Voice GenAI

I thought it was a refreshing idea to simulate an interview with Andrew Ng using ChatGPT 4 and his newsletter "The Batch - What Matters in AI Right Now." published February 27th, 2025.

Here is what I did:

Copy the full newsletter into ChatGPT 4 and write the following prompt:

"Hello, today I want you to simulate an interview between me and Andrew Ng based on the following newsletter he published very recently. I am the interviewer and will ask you the questions, and you answer based on the newsletter context. Here is the newsletter: [Newsletter text]"

Chat GPT: Sounds great! Go ahead and ask your first question, and I'll respond as Andrew Ng based on the newsletter's insights.


Here is our conversation about voice-in voice-out generative models.

Andy:

Good morning Andrew, how are you doing?

"Andrew" said:

Good morning! I’m doing great, thanks for having me. Excited to talk about voice AI and share some insights!

Andy:

Generative Voice AI is the big promise for Natural User Interfaces and these are currently in the making. Surprisingly I could not yet experience any improvements when talking to the call centers of my large Utilities and Telco providers. They are still using the very slow IVR systems. Why is that?

"Andrew" said:

That’s a great question! While generative voice AI is advancing rapidly, large utilities and telecom providers tend to adopt new technologies more slowly due to a few key reasons:

  1. Legacy Systems & Infrastructure – Many call centers still rely on older IVR (Interactive Voice Response) systems, which are deeply integrated into their existing workflows. Upgrading to AI-driven voice interfaces requires significant investment and restructuring.
  2. Accuracy & Control Challenges – Unlike text-based AI, controlling the output of voice-in, voice-out models is still difficult. Businesses need to ensure responses align with policies, comply with regulations, and provide accurate information - something IVR, though slow, still guarantees.
  3. Latency Sensitivity – Real-time voice AI must respond in under a second for a smooth conversation. Today’s AI pipelines (speech-to-text → LLM reasoning → text-to-speech) still introduce delays, which can make interactions feel unnatural. Techniques like "pre-responses" help, but many enterprises aren’t yet comfortable deploying them at scale.
  4. Risk Aversion – Large companies prioritize reliability. A voice AI making a mistake - like offering a refund it can’t process - could lead to compliance issues and customer dissatisfaction. Until voice AI is as controllable as traditional systems, adoption will be cautious.

That said, I believe we’ll see more advanced AI-driven voice assistants in these industries soon. The technology is maturing, and as companies see the benefits of improved customer experience, they’ll start making the transition.

Andy:

Sounds promising but it brings us back to a hot topic: Responsible and secure AI. How will that be solved while responding quickly to the users?

"Andrew" said:

Great point! Responsible and secure AI in voice interactions is a major challenge, especially when balancing speed, accuracy, and control. Here’s how we can address it:

  1. Hybrid Workflows for Control – Instead of relying solely on end-to-end voice models, companies can use structured pipelines:
  2. Pre-Responses for Speed – To minimize latency, AI can generate quick pre-responses like “Let me check that for you”, while the system runs deeper reasoning checks. This keeps the conversation fluid without compromising accuracy.
  3. Real-time Monitoring & Intervention – AI models should operate with audit trails and human-in-the-loop oversight for sensitive cases. Some companies already flag uncertain responses for human review before execution.
  4. Security & Compliance Layers – AI models must respect data privacy regulations (GDPR, CCPA) and implement identity verification to prevent fraud, especially in financial or healthcare applications.
  5. Continuous Learning & Updates – Responsible AI means ongoing monitoring and fine-tuning based on user feedback to reduce biases and improve trustworthiness over time.

While we’re not fully there yet, these strategies make AI-driven voice systems both secure and practical. The key is blending automation with structured oversight, ensuring AI is helpful but also accountable.

Andy:

So you are suggesting sort of a short warm-up conversation to keep the users engaged while AI works behind the scenes to provide the most accurate answer? How much time would we need before we come up with these answers? Think about multiple questions coming in from the user in sequence.

"Andrew" said:

Exactly! A short "warm-up" response keeps users engaged while AI processes their request in the background. The goal is to mask latency and create a smoother experience, much like how humans naturally pause or acknowledge before answering.

How much time is needed?

For a single response, we aim for 0.3–1 second (human-like conversational speed). The breakdown:

  • Pre-response (e.g., "Let me check that for you") → ~0.1–0.3 sec
  • AI processing (LLM reasoning, validation, API calls, etc.) → ~0.5–1.5 sec
  • Final response via TTS → ~0.2–0.5 sec

Advanced optimizations (like streaming responses or parallel processing) can reduce total time to ~0.5–1 second, making it feel seamless.

Handling Multiple Sequential Questions

When users ask follow-ups in quick succession, the AI can:

  1. Process in parallel – Continue reasoning about Question 1 while pre-responding to Question 2.
  2. Use context memory – Maintain conversation history so each response is contextual and flows naturally.
  3. Prioritize faster queries – Simple requests (e.g., "What's my balance?") get answered instantly, while complex ones (e.g., "Can I change my plan?") use the pre-response technique.

With ongoing improvements in LLM efficiency and multimodal AI, I expect real-time voice conversations with AI to feel much more natural within the next 1–2 years.

Andy:

Great to hear about the advancements GenAI is maing in the Voice-in Voice-out field Andrew. I hope we can have another interview like that soon and you can share the first real life examples with us.

"Andrew" said:

Absolutely! The progress in voice-in, voice-out AI is exciting, and I look forward to seeing more real-world applications emerge. As the technology matures, we’ll see more real-time, intelligent, and natural voice interactions across industries.

Let’s definitely reconnect soon—I’d love to share new breakthroughs and real-life deployments when they happen. Thanks for the great conversation! Keep building! ??


I hope you liked this simulated interview format conversation with ChatGPT.

I truly enjoyed it and hope to be able to have a real conversation with Andrew Ng one day soon!


P.S. This edition of "The Batch" includes an incredible article titled:

Reading Minds, No Brain Implant Required

To date, efforts to decode what people are thinking from their brain waves often relied on electrodes implanted in the cortex. New work used devices outside the head to pick up brain signals that enabled an AI system, as a subject typed, to accurately guess what they were typing.

David Roldán Martínez

?? Chief Digital Officer | Digital Transformation Lead | AI Leadership ? Integrations Technology & Governance Strategic Advisor | APIs | AI | Smart Ecosystems ?? Visionary Innovation Evangelist | Tech Writer

2 天前

要查看或添加评论,请登录

Andy Jaffke的更多文章

社区洞察

其他会员也浏览了