Nvidia introduced a new experimental generative AI, which they call a “universal tool for working with sound”.
This model, known as Foundational Generative Audio Transformer Opus 1 (or Fugatto), can take text prompts and use them to create audio or modify existing musical, voice, and sound files. An international team of AI researchers worked on the development of the model, which, according to NVIDIA, made its “multi-accent and multilingual capabilities” even stronger.
Rafael Valle, one of the project’s researchers and the manager of applied audio research at NVIDIA, noted: “We wanted to create a model that understands and generates sound just as humans do”.
The company provided several examples where Fugatto could be useful. For instance, music producers could quickly create song prototypes, which can then be easily edited by changing styles, voices, and instruments.
People could use Fugatto to create language learning materials with a selected voice. And video game developers could create different versions of pre-recorded sounds to match changes in the game depending on the choices and actions of players.
Furthermore, researchers found that the model could perform tasks it was not trained for, with little additional tuning. For example, it can combine separately learned commands to generate an angry voice with a specific accent or the sound of birds singing during a thunderstorm. The model is also capable of creating sounds that change over time, such as the sound of approaching rain.
NVIDIA has not announced whether it will provide public access to Fugatto. However, this AI model is not the first generative model capable of creating sounds from text prompts. Previously, Meta released an open AI toolkit that can generate sounds from text descriptions. And Google has its own AI called MusicLM, which converts text into music.