Modern text-to-speech algorithms are incredibly capable, and you needn’t look further for evidence than Google’s recently open-sourced SpecAugment or Translatotron — the latter can directly translate a person’s voice into another language while retaining tone and tenor. But there’s always room for improvement.
Toward that end, researchers at Microsoft recently detailed in a paper (“Almost Unsupervised Text to Speech and Automatic Speech Recognition“) an AI system that leverages unsupervised learning — a branch of machine learning that gleans knowledge from unlabeled, unclassified, and uncategorized test data — to achieve 99.84% word intelligibility accuracy and 11.7% PER for automatic speech recognition. All the more impressive, the model required only 200 audio clips and corresponding transcriptions.
The key turned out to be Transformers, a novel type of neural architecture introduced in a 2017 paper coauthored by scientists at Google Brain, Google’s AI research division. As will all deep neural networks, Transformers contain neurons (mathematical functions loosely modeled after biological neurons) arranged in interconnected layers that transmit “signals” from input data and slowly adjust the synaptic strength — weights — of each connection. (That’s how the models extracts features and learns to make predictions.) Uniquely, though, Transformers have attention: Every output element is connected to every input element, and the weightings between them are calculated dynamically.
The Microsoft researchers incorporated a Transformer component into their AI system design that could take speech or text as input or output, and they sourced the publicly available LJSpeech data set — which contains 13,100 English audio snippets and transcripts — for training data. The team chose the aforementioned 200 clips at random to create a training data set, and they leveraged a denoising auto-encoder component to reconstruct corrupt speech and text.
The results weren’t half bad, considering the small corpus — the researchers note that it handily outperformed three baseline algorithms in tests. And several of the published generated samples sound connivingly human-like, save a slight robotic twang.
Audio Player The coauthors leave to future work “pushing the limit” of unsupervised learning by purely leveraging unpaired speech and text data with the help of other pre-training methods. “In this work, we have proposed the almost unsupervised method for text to speech and automatic speech recognition, which leverages only few paired speech and text data and extra unpaired data,” they wrote. “We demonstrate in our experiments that our designed components are necessary to develop the capability of speech and text transformation with few paired data.”
The paper will be presented at the International Conference on Machine Learning, in Long Beach California later this year, and the team plans to release the code in the coming weeks.