Another novelty in the field of artificial intelligence

Reading Time: 2 minutes

Microsoft has introduced the artificial intelligence VALLE, which is able to simulate any human voice based on an example lasting only three seconds. At the same time, the voice is imitated very reliably, preserving both the timbre and the emotional coloring of the original.

Unlike other text conversion methods, which often synthesize speech by manipulating waveforms, Microsoft’s development mainly analyzes exactly how a person sounds, breaks this information into separate “tokens” and uses training data to compare their “knowledge” about how this voice will sound if the AI pronounces other phrases. VALL-E is trained on recordings of 60 thousand hours of conversations of more than 7 thousand speakers, which is about 100 times more than in existing systems.

The most interesting thing is that it takes VALL-E only a few seconds to clone a voice, and besides, the similarity of emotional timbre and background noise. Moreover, the fact that the model is still quite new, but already has a huge advantage over others. And, of course, further improvements are expected to lead to even more human-like speech.

Google showed its AI Duplex, which can also speak almost indistinguishably from a human, back in 2018, but the essence of Microsoft’s development is not in the AI itself, namely in its ability to imitate different voices.

As in the case with all the other different AI models there is concern about the misuse of Vall-E. For example, to imitate the voices of public figures, politicians, or stars. Criminals will also be able to obtain confidential data if they make a person believe that he is talking to family, friends, or officials. Some security systems also use voice identification.

However, VALL-E developers and researchers claim that such risks can be reduced using a different model. It is possible to develop a model that will be able to determine whether the audio was generated by the VALL-E or not. However, how it will look at the moment of a telephone conversation, for example, is not entirely clear.

But so far this development is not open for public use and we ourselves cannot try out VALL-E. We cannot make sure how well the model imitates intonation and changes in the emotionality of speech, and how quickly everything works. But at the same time we have the opportunity to listen to the already generated text from VALL-E.

Sources:

https://www.business-standard.com/article/technology/microsoft-s-new-ai-tool-vall-e-can-replicate-any-voice-in-just-3-seconds-123011000696_1.html

https://valle-demo.github.io/

Leave a Reply