Back in November, we had a Tavus retreat in a remote location in Georgia. We have an incredible team, and I felt inspired as we headed home after a week of collaboration and talking about our vision. There was one issue that so many people were having with early versions of CVI. “It interrupts me!” Wish it knew when to speak and when not to.
I ran the idea past Ari Korin and Hassaan Raza: what if we could predict when the user was done speaking? As I recall it they both kinda shrugged and said: build it. That’s one of the things I like about our culture: build it. My favorite two words.
On the plane ride back I started building the training pipeline for a new turn detection model. It started with some research, chatting with AI about how to build AI. What’s the state of the art? How can I do X? What can I expect from Y? What’s the fastest way to produce Z? By the time I landed I had trained the first version of what would become Sparrow. It was an LSTM trained on one thousand utterances.
A proof of concept that showed there was something learnable. At the time we had to “shine the flashlight” on the unknown. So, the task changed. How can we deploy something that starts improving conversations right away. This led to the very first, beta version of Sparrow, and this is when it earned its name.
That first version had a 300ms latency and was built on top of a prompted LLM and function calling. For CVI, that might have been problematically slow. However, at the time we had to slow down responses so users could finish their thought. So, we shipped it to our demo.
The result was appreciable. Carter (soon to be replaced by Charlie), our demo persona, would actually wait for you to finish your thought. Then, he would respond almost immediately when you were done. It was fantastic. By December we had this in production.
Now, as of two weeks ago, we have had Sparrow-0 in private beta. Yesterday it went GA. Sparrow-0 represents a step change from that first version (Sparrow -1?).
Sparrow-0 has a 10 ms latency, and produces smooth and more natural turn transitions. How? Well, it uses a combination of a customized Transformer architecture, our hyper fast voice recognition system, and human response-time modeling with a sort of naturalness function. Sparrow-0 models the Gaussian distribution of human response times in conversation, and chooses the appropriate one based on semantic, lexical, and context information.
The result is fantastic in action!
I work among some of the smartest and most talented engineers, designers, marketing, sales, support, ops, researchers, managers, and leaders. Tavus teams, and investors, support for our vision, and the support, inspiration, guidance, access, and resources at every step: those are the raw materials for success.
From retreats in remote locations to the best equipment, the best team, and the best vision: it’s been a really inspiring moment in time for me, and what a time to be alive! :)