by Paul Lilly — Originally published by HOTHARDWARE.com
Close your eyes for a brief moment and imagine a world where it’s near impossible to tell what is real from what is computer generated. The truth is, we are virtually already there to an extent, and it has enormous implications. We have already seen several examples of spoofed images and videos, known as deepfakes, but apparently artificial intelligence schemes can do a damn good job at replicating someone’s voice as well.
Researchers from Dessa, a startup based in Toronto that “helps the world’s largest and most complex organizations build real-world value with advanced AI,” demonstrated this with a believable audio clip of what sounds like Joe Rogan, former host of Fear Factor, stand up comedian, mixed martial arts enthusiast, and the the guy behind one of the most popular podcasts on the planet.
The team behind the demo created and produced the clip using a text-to-speech deep learning system they developed called RealTalk. This frighteningly amazing piece of technology generates life-like speech using only text inputs.
“It’s surreal for our engineers to be able to say they’ve legitimately created a life-like replica of Joe Rogan’s voice using AI. Not to mention the fact that the model would be capable of producing a replica of anyone’s voice, provided that sufficient data is available. As AI practitioners building real-world applications, we’re especially cognizant of the fact that we need to be talking about the implications of this,” Dessa explains.
Dessa is fully aware that this is both fantastic and “pretty f*cking amazing.” In the wrong hands, this sort of technology can be used to impersonate family members to obtain personal information or trick someone out of their money, to thwart security checkpoints, and spoof a politician to manipulate an election, among other nefarious scenarios.
It’s not all bad, though. There are legitimate uses for something like this, such as automating voice dubbing for any media in any language, improving accessibility options for people who communicate through text-to-speech devices (such as people with Lou Gehrig’s disease), and so forth.
“We won’t pretend to have all the answers about how to build this technology ethically. That said, we think it will be inevitably built and increasingly implemented into our world over the coming years. So in addition to raising awareness and acknowledging these issues, we also want to show this work as a way of starting a conversation on speech synthesis that must be had,” Dessa added.
Dessa is not making this technology open source at the moment, nor is it releasing its research, model, or datasets publicly just yet. The company is taking what it says is a responsible approach by first making the public aware of what it’s created, and then seeing where the discussions go from there.
Dessa created a website to test your ability to discern Joe Rogan’s real voice
from the computer generated one. There are eight clips, some of which are actually
Joe Rogan, others are not. For each clip, you can select “Joe Rogan” or “Faux Rogan,”
and you’ll immediately be given a response of “Correct!” or “Nope!”
Think you can tell the difference? Click the image below and try the test for yourself.
I correctly guessed on six of the eight clips, and had I gone with my gut on the two that I missed, I could have gotten a perfect score. But I didn’t. I also had to listen to several of them multiple times before being confident in my selections.
This tells me there is definitely room for improvement—the technology doesn’t quite get the nuances of speech perfectly. However, this is mostly noticeable when listening to a selection of both real and fake clips. In a vacuum, such as social media, it would be really easy to be tricked by a fake clip.