Right now, generative AI is in the limelight, with a whole bunch of noise to boot, but despite all the attention, the field can still be tricky to navigate. With new AI dubbing providers popping up all the time, it’s more important than ever to ask the right questions to potential providers to get the best service but, more importantly, the best quality.

This here, is a guide to the all-important questions that will unlock the best quality AI dubbing providers and ensure that premium content gets the flawless dubbing it deserves. 

Here we break down one sub-segment of AI video localization: AI dubbing, the process of localizing a video for different languages by creating new voice tracks through AI-powered solutions. 

AI dubbing companies vary in the type of service they offer to end users. When choosing a service, ask: 

"Are the voices you offer curated (or proprietary), or are they “off-the-shelf” (third party)?"

To answer this question, it’s essential to understand the difference between 'curated' (known in the industry as proprietary voices) and 'off-the-shelf' (known as third party voices). Off-the-shelf voices are built for a wide set of use cases – for instance as the voice of a home assistant, reading the weather on an app or other use cases where the expressiveness of the voice is not the main aim of the voices. In short, off-the-shelf voices are not usually designed for use as dubbed voices. 

Curated (proprietary voices), conversely, are designed to suit a particular format. Papercup’s voices, for instance, are specifically designed with dubbing for video in mind. 

For most production-ready content, proprietary voices should be used over off-the-shelf solutions; they will be far closer to the voices in the original content and therefore deliver an experience that meets customer expectations. Off-the-shelf voices can be used in limited cases – say, for instance, training content where the expressiveness of the voice is not the main aim.

Ask for a sample to check output quality before moving ahead with a provider. A good AI dubbing provider will not only suggest samples as the first step in the journey, it will encourage you to gather feedback from native language speakers in your organisation who can give accurate feedback on the quality of the sample. To ascertain if a company uses experts to check the quality, ask:

"Is a human quality check included, to ensure output is a consistently high standard?"

Trust is currency in the media. Yet, most off-the-shelf AI dubbing solutions don’t have any human quality assurance, meaning brand terms can be mis-pronounced, mis-translations go unnoticed and the wrong tone of voice applied. 

In the absence of human quality control, companies risk jeopardising brand standards expected by their audiences and therefore the dilution or, worse, trustworthiness, of their brand. 

With a premium AI dubbing provider, these things should be few and far between. Clients we work with like Fremantle and Insider opt for a provider with human quality assurance so that they can be confident that any AI dubbing meets the high standards associated with their brands. 

Companies should ask providers how they assess dubbing output. At Papercup, we use professional translators to not only assure translation quality but also optimize localized content by modifying the tone and pronunciation if needed. 

Papercup's quality assurance process uses real translators to adjust the translation, pronunciation and delivery of the new language

Is the Machine Learning (ML) team large enough to support the scope of languages offered? 

Be wary of providers with large language offerings, such as catalogs with over 30 or 40 languages. Achieving the levels of expressivity demanded by large media brands is difficult at such scale. Often, services pull in off-the-shelf voices and do minimal levels of work on top. 

Before engaging a company offering an extensive library, ask about the size of the machine learning team – and dig into the specifics of how they improve expressivity across so many languages. Improving the expressivity across a large language offering requires more than one or two machine learning engineers. 

What to look out for when evaluating AI voices

When it comes to AI voice technology, there are many different options. But while some can solve media companies’ problems, others could complicate things further. 

Ask potential AI dubbing providers which of the following technologies they use to create their voices. Each has different use cases and some come with pitfalls to be aware of:

  1. Voice cloning

Voice cloning is the process of mimicking a specific individual’s vocal characteristics and speech patterns using artificial intelligence. Successful clones aim to be virtually indistinguishable from an individual’s voice. 

With the increasing number of viral hoaxes, disinformation, and fake news, media companies need to be careful with this technology. If voice cloning is used, they should be clearly disclosed to avoid mistrust and confusion.

They should also consider the ethical and legal implications. Has the person consented to hand over the rights to their voice IP? If yes, is it clear how far that extends? 

  1. Voice conversion

Voice conversion uses speech processing techniques to mimic the intonation and vocal nuances of a target individual. While voice cloning aims to copy the identity of the original speaker, voice conversion is able to copy the elements that make up speech i.e. accent, intonation, and rhythm to create new voices. 

Voice cloning and voice conversion can be used together. For example, cloned audio files could be used alongside a voice conversion control to dub a whole movie with just one voice actor. 

  1. Off-the-shelf AI voices

The cheapest and quickest form of dubbing – but also the lowest quality – uses widely-available voices generated by large tech companies like Google and Microsoft. 

These AI voices aren’t optimized for video and aren’t well suited to long-form content. They’re versatile and cover a lot of languages, so they’re great for mobile phones, assistants, and accessibility. However, they lack nuance, expressivity, and naturality – the most important qualities for a professional-sounding dub.

  1. Bespoke voices

Machine Learning models can take an amalgamation of data points and create bespoke synthetic voices. This option combines the quality of voice cloning with the ease and speed of off-the-shelf voices for accurate dubbing that scales. Because the voice is new, it doesn't mimic any particular individual and avoids the ethical and legal issues of vocal cloning.

At Papercup, we use our own data to create bespoke voices when dubbing audio into a new language. We record our voice artists in our in-house studio or our partner studios globally, and also use the huge back catalogs of content belonging to the large media companies that partner with us as training data for their projects. We have a bank of over one hundred bespoke voices across the world’s most spoken languages, which enables us to create high quality customized dubbed audio for some of the world’s most high profile media companies. 

Finding the right AI dubbing partner

As media companies begin to explore a brave new AI landscape, a lot rests on their choice of technology partner. Organizations will need to balance cost and scalability against results that audiences can trust to improve engagement, retention, and revenue.

At Papercup, we’re a principles-first AI company. We help businesses leverage AI in a way that enhances their current operations and produces the best results ethically and safely. That’s why our technology mixes industry-leading AI advances (and the ML resources to support them) with detailed human-in-the-loop quality assurance. It’s also why brands like Sky News, Bloomberg, and Insider trust us with their content libraries. 

To learn more about our AI dubbing process for media companies, speak with a consultant.