https://www.papercup.com/blog/large-language-models Article by Papercup's Head of Product, Kilian Butler
For hundreds of years, people have been striving to replicate the capabilities of human speech, employing technologies as varied as resonance tubes and machine learning. From HAL 9000 to C-3PO, KITT from ‘Knight Rider’ and Samantha from ‘Her’, fiction is rich with examples of computers that have mastered human speech.
AI and spatial computing are transitioning from science fiction into reality, marking the onset of a new generation of technology. Speech will be a core medium of interaction with AIs as they improve. To many it may feel like speech is mature compared to technologies like LLMs, but this underestimates how much information we convey in a sentence. Speech is a nuanced process of meaning-making that requires true understanding of human communication. In this blog we dive into this challenge: prosody generation!
The last few decades have seen significant improvements in electronic speech synthesis; progress that is audible when comparing the robotic tones of Stephen Hawking’s 1980s speech system, Equalizer, to the realistic speech produced by AI dubbing available on YouTube and streaming platforms today. However despite significant advancement in speech synthesis in recent years, computers still lag significantly behind humans in speech capabilities.
Machine learning techniques have advanced significantly but we must recognise that the creation of human speech is an extraordinarily complex process involving an intricate partnership between mind and body.
Speech enables humans to communicate using sound. It conveys information in two central ways: the words spoken (linguistic information) and how the words are said (prosodic information). How something is said can be as important as what is said. Prosody (how things are said) includes the patterns of stress, rhythm and intonation that determine how an utterance is delivered and communicates additional information to the words that allows us to infer meaning and intentions. Commonly referred to as expressivity, the prosody of a sentence can convey the speakers’ emotions, certainty, or any number of aspects related to their physical or mental state. Are they sincere or insincere? Was their speech planned or off-the-cuff? Prosody planning, the process by which we determine how to speak, is still relatively understudied and crucially cannot be separated from language production itself.
C-3PO, the metallic humanoid robot from the Star Wars series, represents the platonic ideal of a synthetic speaker.
He communicates his feelings clearly by producing the tones and intonations that leave no doubt of his worries and complaints, all delivered in Received Pronunciation British accent. He’s a perfect example of an advanced synthesis system that generates prosodically appropriate speech. To advance towards this vision, what must the next generation of speech synthesis models be capable of?
The primary role of prosodic features in speech is to enhance communication between the speaker and the listener. Different subtexts in conversation can be communicated through prosodic information like:
The pitch, duration, inflections, intensity, loudness, and a whole host of other elements all represent meaning, or contribute to a function in spoken communications. This function refers to the illocutionary force of an utterance, which is defined as ‘the speaker’s intention in producing that utterance’. For example, if you didn’t want to to go an event, but your friend wants you to join them you might say “Yeah, maybe we should” but with a pause on “yeah” to display you’re uncertain. The illocutionary force here is you communicating that you don’t want to go (despite the words saying otherwise). The very common speech synthesis use case of audible directions (i.e. Garmin, Google Maps) are notable by their lack of illocutionary force. Research teams for these products will be looking to improve on the passive intonation, where all directions are treated equally.
Speech is an incredibly complex problem with multiple axes of variation. Minor changes in prosody can indicate significant changes in meaning, much in the same way that changes to the tone, style or grammar can convey subtext in language (try asking ChatGPT to make your next email passive aggressive).
Speech includes an array of other elements that are key in communications, but challenging to model: sarcasm, attitude (towards yourself or someone else), interruptions, laughter, filler phrases (ums and ahs etc.) and other non-verbal utterances. Speech is further enriched in its role as a communicative function by the presence of disfluencies and non-verbal speech.
Current speech synthesis systems struggle to convey the depth of information that a human is capable of. The fantastic performance of voice cloning models can oftentimes obscure the relative paucity of prosodic features. Models with prosody generation and cross-lingual prosody transfer capabilities (speech-to-speech translation) are however excitingly starting to show promise, but there are unique aspects about the medium that will need to be addressed.
Synthetic speech products must generate both words and prosody from a given context. Large Language Models (LLMs) have displayed impressive improvements in computers’ ability to generate contextual language in the form of text. However, this addresses only half of what it takes to build communicative, intelligent systems. With LLMs, the task is to generate words; in speech models the task is to generate prosody (since synthetic speech can now generate high acoustic quality and accurate pronunciation). Future modelling improvements will necessitate appropriate prosody generation in a less sequential manner. Current state of the art multi-modal LLMs generate linguistic information and feed it to text-to-speech systems to infer prosody. In humans, however, the prosody and linguistic planning processes are more closely intertwined.
Even experts in the field of prosody still do not have a complete understanding of the structure and rules that we follow when communicating. English prosody is comparatively well understood, but there is still substantial debate among academics about how it really works. Understanding of non-English prosody is very limited and sparse, much less the cross-lingual mappings of prosody across languages, dialects and cultures.
The prosody problem exists among a set of problems that were previously out of reach for traditional software. These are challenges that relate to things like natural language, images, and physical space. One could refer to this type of thing as an ‘AI hard’ problem – a problem that can now conceivably be unlocked with modern machine learning techniques and hardware. Our limited understanding of prosody means that we cannot write an exhaustive set of rules which would govern how prosody changes meaning in context. In the same way that LLMs work better than rules based algorithms for contextual language, speech models must learn from data the ability to generate appropriate results from a given context. But what are the other unique aspects of speech that will pose challenges for product teams and researchers?
Speech is a continuous signal processing challenge. This contrasts with image or language generation, which is considerably easier to represent. In this way, speech generation models are more analogous to generative video modelling, which is considerably less mature than image or language generation. This means speech is harder to tokenise and all tokens must be converted into analog signals. This last mile is not present in text or image generation.
Spoken languages are also extremely fragmented. Roughly two thirds of the world speak the top five languages. The remaining third of the world speaks a long tail of thousands of languages. The relative performance of LLMs across different languages (especially low-resource languages) is indicative of the challenges that speech generation will face.
The commercial attempts at productizing personal assistants can give a sense of the scale of the challenge. Apple’s Siri was launched in 2011 and Amazon’s Alexa has been funded to the tune of tens of billions of dollars. Both systems are still largely limited in their ability to generate prosodically appropriate speech despite extensive research and development. Both have made efforts to add some realistic prosodic feature generation, but when these features are applied to the wrong context it can be jarring, invoking the uncanny valley.
Despite the long history of vocoders and text-to-speech, speech is a much less mature field from an machine learning perspective with a comparatively smaller field of talent working on the challenges. Text-to-speech systems have been able to produce speech, which communicates lexical information (the words themselves) and was intelligible for several decades. In essence, text-to-speech was ‘good enough’ for a limited set of use cases for a long time. This influenced the shape of the research industry itself. Less machine learning talent flowed to speech (in favour of areas like computer vision or natural language processing), resulting in optimizations being applied at a component level (like vocoders) rather than in an end-to-end singular system.
Exciting progress has been made in recent years and there is significant potential upside in solving for prosody generation. As a result speech is already attracting more and more talent within machine learning, compounding the benefit to end users.
Many modelling, data, architectural and operational techniques have yet to be applied in full to speech synthesis. However, we are beginning to see green shoots here and there is undoubtably great strides to be made in porting the learnings from other fields.
Machine learning teams will learn from and collaborate with anthropologists, linguists and other experts in the field of speech and language to deploy prosodic models globally. Communication is not a one size fits all system, with high and low context cultures deploying differing methods. Germanic cultures, for instance, communicate directly with language, whereas Asian or Latin cultures are more nuanced in their comm