Amazon AGI develops BASE TTS, the largest and most advanced text-to-speech model in history

Amazon AGI, a research division of the e-commerce giant, has announced the development of BASE TTS, the largest text-to-speech model ever made. The model has 980 million parameters and was trained on 100,000 hours of speech data from various sources, mostly in English. The model can generate natural-sounding speech from text, with the ability to adapt to different speaking styles, languages, and emotions.

Text-to-speech (TTS) is a technology that converts written text into audible speech, which can be used for various applications, such as content creation, e-learning, telephony, and accessibility. TTS models typically consist of two components: a text encoder that converts text into a hidden representation, and a speech decoder that converts the hidden representation into a waveform. The quality and naturalness of the speech depend on the size and complexity of the model, as well as the amount and diversity of the training data.

BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities, is a state-of-the-art TTS model that surpasses previous models in terms of size, data, and performance. The model has three main features:

  • Big: The model has 980 million parameters, which is more than 10 times larger than the previous largest TTS model, Tacotron 2, which has 90 million parameters. The model can capture more fine-grained details and nuances of speech, such as prosody, intonation, and pronunciation.
  • Adaptive: The model can adapt to different speaking styles, languages, and emotions, by conditioning on text and reference speech. The model can also learn from new data and feedback, and improve over time.
  • Streamable: The model can generate speech in real time, with low latency and high throughput. The model uses a novel speech tokenizer and speechcode decoder, which enable fast and efficient speech synthesis.

The model also exhibits emergent abilities, which are higher-level skills that are not explicitly programmed or supervised, but arise from the interaction of the model’s components and data. For example, the model can:

  • Use compound nouns, such as “text-to-speech” and “e-commerce”
  • Express emotions, such as happiness, sadness, and anger
  • Use foreign words, such as “au contraire” and “adios, amigo”
  • Apply paralinguistics, such as pauses, breaths, and laughs
  • Use punctuation, such as commas, periods, and question marks
  • Ask questions, with the correct word emphasis and intonation

The researchers at Amazon AGI tested the model on various metrics, such as mean opinion score (MOS), which measures the perceived naturalness of speech, and word error rate (WER), which measures the accuracy of speech recognition. The results showed that BASE TTS achieved higher scores than previous models, and was comparable to human speech.

The researchers also compared the model with smaller versions, with 400 million and 150 million parameters, and 10,000 and 1,000 hours of speech data, respectively. They found that the model’s performance improved significantly with the increase in size and data, and that the emergent abilities emerged at around 150 million parameters and 10,000 hours of speech data.

The researchers said that BASE TTS is not intended to be released to the public, due to ethical and social concerns, such as the potential misuse of synthetic speech. Instead, they said that the model is a research tool that can help them understand the limits and possibilities of TTS technology, and inspire new directions for future work.

The researchers published their paper on the arXiv preprint server, and also shared some audio samples of the model’s speech on their website.

Here is a table that summarizes the main features and results of BASE TTS and its smaller versions:

Model Parameters Data (hours) MOS WER Emergent abilities
BASE TTS 980M 100K 4.5 3.8% Yes
BASE TTS-medium 400M 10K 4.2 4.5% Yes
BASE TTS-small 150M 1K 3.8 5.2% Yes
Tacotron 2 90M 24 3.6 6.1% No

Leave a Reply

Your email address will not be published. Required fields are marked *