Field notes: Exploring AI transcription services for inclusion
Chris Bush
Head of Design at Nexer. With 20 years’ experience, I help organisations from start-ups to multinationals build UX capabilities that deliver value and create accessible, inclusive, user-centered products and services
Can AI transcription improve peoples access to content?
A few weeks ago Molly introduced Hilary and myself to Hector Minto a senior accessibility technology evangelist at Microsoft.
As a company we often help organisations navigate through the various challenges associated with digital transformation, typically supporting them in improving internal business systems and staff engagement – we do this under the banner of Tomorrow Worklife, which focuses on user behaviour, motivation, usability and change communication.
Tomorrow Worklife incorporates Microsoft technologies and making our tools and systems as accessible and inclusive as possible is very important to us, so we wanted to pick Hector’s brains and learn a bit more about the different types of inclusive technology that Microsoft offer and investigate whether or not there was an opportunity for us to build them more closely into our own tools and service offerings.
Our discussion with Hector gave us some great insight; especially when we found out that there are a lot of great services already in the public domain, albeit mostly in beta form. One of the services we discussed was Microsoft Translator, a platform that records the audio from a presenter’s device and transcribes it in real time into a multitude of languages. Multiple users can login in to the site on their personal devices and choose which language they receive the transcription in – amazing from a meeting/event engagement point of view.
The issue of transcription accuracy came up quite quickly. Whilst services like this are generally good at transcribing conversational content, their accuracy for specialist content can be quite limited. Translator addresses this by using AI; for the first 30–60 seconds of any given recording the user is served up a verbatim translation, but during that time the AI is sitting quietly in the background calibrating. After that first minute, the AI kicks in and starts transcribing the content using a more contextual model.
Also appearing on our radar was Otter.ai, another AI tool but this time addressing transcription. Whilst at UX Copenhagen, I met Quinn Keast, a UX designer who wrote a brilliant article on consent & ethics in tracking user behaviour last year. Quinn has a hearing problem, and during a workshop we both attended, he used Otter to automatically transcribe the group’s conversations. The app did a reasonable job of filtering out background chatter and, as the workshop progressed, the AI was quick to adapt previously transcribed text as it learnt more about the evolving conversation around our table.
When it comes to using the technology, it’s still early days, but some of our first thoughts have been around things like:
Automatic transcription, for example:
- Conference calls, to help increase access in real time and, by extension, to assist with note taking.
- Media content – minimising transcription time and editing.
Using this technology as a standard part of our research and analysis framework:
- We’ve already started running some of our usability sessions through a video management service called Stream. Stream can provide caption and transcription out of the box and my first impressions are quite hopeful. As you may expect, the transcript quality is quite variable and depends on sound quality, but the captions it produces are easy to edit and tidy up, and as any researcher will tell you, a decent transcript of a session is worth its weight in gold.
- Otter.ai has already won the hearts of a few members our team (including me) and we’re trying it out in meetings to see how well it works in the wild. If all goes well, we’re going to look at using it in our interview sessions too.
At a personal level, I’m also interested in how these technologies could help us deliver more inclusive events. It was fantastic to read about a TEDx event in Seattle where the organisers used Microsoft services to provide live transcriptions throughout the day (https://mismatch.design/stories/2018/12/14/how-automated-captioning-saved-tedxseattle-for-this-fan/), and I’m curious to explore whether the technology is ready to offer attendees of this year’s Camp Digital a similar experience. To be honest, from what we’ve learnt so far, it sounds like the AI systems currently need quite a lot of behind the scenes configuration to transcribe event presentations well, but given that more and more events are trying to use these technologies (check out Otter's transcriptions of TechCrunch’s Disrupt event) and that the quality of the transcriptions is improving exponentially, it feels like we might be in a place to run a proof of concept or tech demo at Camp Digital to see what feedback we get.