Microsoft researchers are working on a text-to-speech (TTS) model that can mimic a person’s voice – complete with emotion and intonation – after a mere three seconds of training.
The technology – called VALL-E and outlined in a 15-page research paper released this month on the arXiv research site – is a significant step forward for Microsoft. TTS is a highly competitive niche that includes other heavyweights such as Google, Amazon, and Meta.
Redmond is already using artificial intelligence for natural language processing (NLP) through its Nuance business – which it bought for $20 billion last year including both speech recognition and TTS technology. And it’s aggressively investing in and using technology from startup OpenAI – including its ChatGPT tool – possibly in its Bing search engine and its Office suite of applications.
A demo of VALL-E can be found on GitHub.
In the paper, the researchers argue that while the rise of neural networks and end-to-end modeling has rapidly improved the technologies around speech synthesis, there are still problems with the similarity of the voices used and the lack of natural speaking patterns in TTS products. They aren’t the robotic voices of a decade or two ago, but they also don’t come off as completely human either.
A lot of work is being put into improving this, but there are serious challenges according to the Microsoft eggheads. Some require clean voice data from a recording studio to capture high-quality speech. And they need to rely on relatively small amounts of training data – large-scale speech libraries found on the internet are not clean enough for the work.
For current zero-shot TTS generators – where the software uses samples not included in the training – the work is complex. It can take hours for the system to apply a person’s voice to typed text.
“Instead of designing a complex and specific network for this problem, the ultimate solution is to train a model with large and diverse data as much as possible, motivated by success in the field of text synthesis,” the researchers wrote, noting that the amount of data being used in text language models in recent years has grown from 16GB of uncompressed text to about a terabyte.
VALL-E is “the first language model-based TTS framework leveraging large, diverse, and multi-speaker speech data,” according to the boffins.
They trained VALL-E with Libri-Light – an open source dataset from Meta that includes 60,000 hours of English speech with more than 7,000 unique speakers. By comparison, other TTS systems are trained using dozens of hours of single-speaker data or hundreds of hours with data from multiple speakers.
VALL-E can keep the acoustic environment of the voice. So if the snippet of voice used as the acoustic prompt in the model is recorded on the telephone, the synthesized spoken text would also sound like it’s coming through the phone.
The capturing of emotion is similar, the researchers claim. If the seconds of recorded voice of the acoustic prompt is emoting anger, then the synthesized speech based on that voice will also display anger.
The result is a TTS model that outperforms others in such areas as natural sounding speech and speaker similarity. Testing also indicates that “the synthesized speech of unseen speakers is as natural as human recordings,” they assert.
The researchers noted some issues that need to be resolved – including that some words in the synthesized speech end up missing, are unclear, or are duplicated. There also isn’t enough coverage of speakers with accents, and there needs to be greater diversity in speaking styles.
The global TTS market is estimated to grow to tens of billions of dollars by the end of the decade, with both established players and startups driving development of the technology. Microsoft’s Nuance business has its TTS product and the software behemoth offers TTS service in Azure. Amazon has Polly, Meta has Meta-TTS, and Google Cloud also offers a service.
All that makes for a crowded space.
The rapid improvement in the technology raises various ethical and legal issues. A person’s voice could be captured and synthesized for use in a wide range of areas – from ads or spam calls to video games or chatbots. They could also be used in deepfakes, with the voice of a politician or celebrity combined with an image to spread disinformation or foment anger.
Patrick Harr, CEO of anti-phishing firm SlashNext, told The Register TTS could also become yet another tool for cybercriminals, who could use it for vishing campaigns – attacks using fraudulent phone calls or voice messages thought to be from a contact the victim knows. It also could be used in more traditional phishing attacks.
“This technology could be extremely dangerous in the wrong hands,” Harr said.
The Microsoft researchers noted the risk of synthesized speak that retains the speaker’s identity. They said it would be possible to build a detection model to discern whether an audio clip is real or synthesized using VALL-E.
Harr said that within a few years, everyone could have “a unique digital DNA pattern powered by blockchain that can be applied to their voice, content they write, their virtual avatar, etc. This would make it much harder for threat actors to leverage AI for voice impersonation of company executives for example, because those impersonations will lack the ‘fingerprint’ of the actual executive.”
Here’s hoping, anyway. ®