Skip to content

How to Make Sure a Voice Is AI-Generated? It's Not Simple But There's a Way

Photo by Erwi / Unsplash

AI-assisted or entirely generated content is flooding over the Internet, and content-sharing and streaming platforms are the ones that suffer most. Given that the combat is predetermined and most probably doomed, the platforms want at least know how to control it. YouTube has recently rolled out an update aimed at labelling AI-videos, while OpenAI's Sora will use watermarking to help us distinguish between the real and the fake.

Tools that can identify if an image is AI-generated also exist now, but what about audio and voices? Can we be 99% sure that a song we're listening to on Spotify is actually sung by a human? The answer is yes, and Pex is one of the companies that can do just that.

But what's the issue if a song is sung by AI—who cares? Well, voice cloning and swapping raise questions of attribution and compensation for the original creators.

AI-generated voices have penetrated the music industry for good. The fake Drake, who is actually an anonymous TikToker who made a fake The Weeknd and Drake song last year, has shaken up the industry and inspired thousands of copycats and AI startups along the way. Some artists lend their voices to be cloned voluntarily, though. Grimes is one of those—the singer created an entire platform that allows her fans to use her voice, make their own songs with it, and even get paid for that! Other artists who weren't so all-in collaborate with AI companies and let them use their voices to train AI models, thus contributing to a more ethical use of the technology.

But all this didn't go without consequences. The proliferation of AI voices mimicking famous singers has sparked a necessity for reliable methods to identify such content and ensure proper licensing and credit. Watermarking and artefact detection are two of those. Both have their weaknesses, however.  

Watermarking, while effective in theory, can be circumvented through removal or modification attempts, rendering it unreliable. So does artifact detection—although initially promising— it struggles to keep pace with the rapid advancements in AI technology.

Pex offers a better option—ACR (Automated Content Recognition) technology that compares the unique qualities of one content piece with another and detects a match (if any). Unlike watermarking, this new approach examines entire files and checks if there's any overlapping in place to pinpoint AI-made content.

"ACR works by comparing the qualities of one piece of content against those of another, and determining if there is a match. It does not look for artifacts or watermarks, but examines entire files to find overlapping qualities," Pex says in their blog post.

Which parts of audio does it anslyse, though? Pretty much all of them. Melody, phonetic, and voice matching are just a few facets that contribute to the analysis thanks to which Pex can detect AI-generated covers, vocal swaps in recordings, as well as unreleased tracks with artificial voice mimicking.

Voice ID is another tool that guards the voice originality, with singer matching being one of the two capabilities the tech provides.

Singer matching determines whether one or multiple recordings feature the same singer or singers, regardless of musical style or language. It can accurately identify matching segments, even if they are ten seconds long and compare original voices to AI-generated ones.

Singer identification, the second tool of the Voice ID, determines the specific identity of a singer or singers in an audio file, which requires a reference database containing the voices and identities of known singers. Voice ID technology extracts biometric vocal traits, creating a digital fingerprint that allows for singer identification without retaining original audio material, so the identification of known singers in any audio recording, including those featuring AI-generated vocals, becomes possible.

While Voice ID alone cannot differentiate between real and AI-generated vocals, combining it with ACR technology enhances its efficacy. ACR technology complements Voice ID by enabling the detection of AI-generated music. For instance, when creators release an entirely new song using AI voices that sound like the vocals of popular artists, the combination of ACR and Voice ID comes in handy.

"We identified this AI song using vocals that impersonate popular artists Drake, Travis Scott, and 21 Savage. Each artist has a distinct vocal style represented at various points in the track, and our advanced technology is able to identify each voice and the specific spots in the song in which they are singing or rapping," Pex shares in their article. "We also used a combination of our Voice ID technology and ACR to identify this AI cover of Johnny Cash singing “Barbie Girl” by Aqua. The vocal styles and genres could not be more different here, but thanks to the underlying melody and lyrics, we were still able to find this cover. We then identified the voice as Johnny Cash to further determine that this song was created with AI."

But Pex isn't the only tool that can identify that the voice is AI-generated. Or, it's more likely that these tools only claim they can do the same because it's pretty complicated to pinpoint AI-made vocals and audio, in general. Earlier this year, NBC News reported that "while dozens of tools and products have popped up to try to detect AI-generated audio, those programs are inherently limited and won’t provide a surefire way for anyone to quickly and reliably determine whether audio they hear is from a real person." Reality Defender's CEO told NBC News that they "never say 100%,” and their "highest probability is 99% because [they] never have the ground truth."

A study conducted by Kimberly Mai at University College London shows that people aren't good at detecting artificial voices, either. According to New Scientist, over 500 people have taken part in Mai's research and have been tasked with discerning between authentic and manipulated speech in a series of audio clips. Within these recordings, some featured a genuine female speaker articulating standard sentences in English or Mandarin, while others were the product of deepfake technology, synthesised by AI models trained on female voices. The takeaway is that people accurately identified AI-made and authentic voices about 70% of the time for both the English and Mandarin voice samples.

Although it's complicated to identify synthetic voices in songs and videos, it's not entirely impossible. The only caveat is that AI cloning tools grow faster than the technology that can hunt them down.