Meta Platforms’ AI lab released Voicebox last week, a machine learning model that can convert text into voice. Unlike previous text-to-speech models, Voicebox can edit, remove noise, and transfer styles, all of which are activities for which it has not been specifically trained.
Researchers at Meta used their own unique methodology to train the model. Initial findings are encouraging and potentially power many future applications, but Meta has not published Voicebox owing to ethical concerns regarding abuse.
What We Mean By Flow Matching
It is possible to synthesise speech in English, French, Spanish, German, Polish, and Portuguese using Voicebox, a generative model. It is similar to large language models (LLMs) in that it has been trained on a highly generic job. Voicebox, in contrast to LLMs, has been taught to learn the patterns that map speech audio samples to their transcripts, rather than the statistical regularities of individual words or sequences of words.
After developing such a model, it may be used in a wide variety of downstream applications with little adjustment. “The goal is to build a single model that can perform many text-guided speech generation tasks through in-context learning,” the researchers at Meta wrote in their article (PDF) outlining Voicebox’s technical specifications.
When compared to diffusion-based learning techniques often used to train generative models, Meta’s “flow matching” methodology is more effective and generalizable. Using this method, Voicebox may “learn from varied speech data without those variations having to be carefully labelled.” Voicebox was trained using 50,000 hours of speech and transcripts from audiobooks without requiring any manual labelling on the part of the researchers.
For this model to be trained, “text-guided speech infilling” must be used, which implies that the model must learn to anticipate a snippet of speech based on the surrounding audio and the full text transcript. What this implies in practise is that a training audio sample and its accompanying text are given to the model. The model then attempts to construct the masked segment of the audio based on the surrounding audio and the transcript. With enough practise, the model can generalise to the point where it can create natural-sounding speech from text.
Language-agnostic voice duplication, error-free speech editing, and more
Voicebox can execute numerous jobs that it has not been trained for, in contrast to generative models that are trained for a single purpose. To illustrate how the approach works, consider how a two-second voice sample might be used to create new speech for written text. According to Meta, this feature may be utilised to provide the speech-impaired the capacity to communicate, as well as to personalise the voices of non-playable game characters and virtual assistants.
Voicebox may also be used to accomplish a variety of style transfers. Two audio and text examples, for instance, would be useful for training the model. The first audio sample will be used as a reference for the overall tone and presentation of the second. As an added bonus, the model can perform the same thing in several languages, which might be used to “help people communicate in a natural, authentic way — even if they don’t speak the same languages.”
The model is also capable of performing several editing operations. You may give the audio and text to Voicebox and have them mask off the portion with the background noise, such as if a dog starts barking while you’re recording. The model will make use of the text to recreate the silenced section of the original audio.
The similar method may be used to alter vocal recordings. If you realise you spoke a word incorrectly, you may send Voicebox a masked audio sample and a transcript of the corrected text. The model will fill in the blank with new text that is consistent with the style of the rest of the document.
Voice sampling is one of Voicebox’s most intriguing uses. From a given text sequence, the model may output a wide variety of synthetic voice samples. This capacity may be used to create artificial data for use in educating further speech processing models. In contrast to the 45–70% error rate degradation seen with synthetic speech from previous text-to-speech models, “our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech,” writes Meta.
Voicemail has its limitations as well. It does not adapt well to informal, non-verbal communication in conversation since it was trained on data from audiobooks. It also doesn’t let you tweak every nuance of the computer-generated voice’s pitch, volume, inflection, and timbre. The Meta research group is looking at potential solutions to these future roadblocks.
Unseen Model As Of Yet
The potential dangers of AI-created media are becoming an increasing topic of discussion. Recently, fraudsters attempted to con a lady by pretending to be her grandchild over the phone using an artificial intelligence-generated voice. Voicebox and other advanced speech synthesis systems might be exploited for these or other malicious objectives, such as forging evidence or distorting actual sound recordings.
Meta acknowledged in its AI blog that “as with other powerful new AI innovations, this technology brings the potential for misuse and unintended harm.” Meta did not disclose the model because of these doubts, but they did publish a technical article detailing the model’s design and training procedure. In order to reduce potential dangers, the research also describes a classifier model that can recognise Voicebox-created speech and audio.