Spotify’s AI Vocal Coach

Spotify wants to make AI audio books more realistic.

Photo by Connor Lin/The Daily Upside

Sign up to uncover the latest in emerging technology.

Spotify wants to add AI emotion to your listening experience. 

The streaming company is seeking to patent a system for training and executing “text-to-speech synthesis.” Spotify’s tech takes a passage of text and converts it into audio, aiming to do so in a way that actually sounds human and portrays the intent of the text. 

Here’s how it works: This system feeds text into a synthesizer that’s built with an AI prediction network configured to convert the text into speech data. Next, that speech data is fed to a neural network-based Vocoder, or another synthesizer built specifically for vocal data, which adds in speech attributes conveyed in the initial text such as emotion, intention, projection, pace and accent, when creating said speech. 

Spotify said the system can create speech that can convey emotions such as anger, happiness or sadness, intentions such as sarcasm, projections like whispering or shouting, and accents like French or British. The two-model process makes it sound more “natural, realistic and human-like,” Spotify notes. 

The models are trained on datasets consisting of audio samples and corresponding text that represent different speech attributes, and are trained until a “performance metric” reaches a certain threshold, a.k.a., until their output sounds real enough. 

Photo via the U.S. Patent and Trademark Office.

This isn’t the first time we’ve seen Spotify take an interest in AI voice technology. Last June, the company acquired Sonantic, an AI voice platform, for an undisclosed sum. The company said in its announcement that the acquisition would allow it to “create high-quality experiences for our users by building on our existing technical capabilities.”

“This integration will enable us to engage users in a new and even more personalized way,” Ziad Sultan, Spotify’s VP of personalization, said at the time. Since then, the company debuted an AI-powered personal DJ backed by Sonantic’s technology. 

But the tech in this patent has the potential to speak more than just a few sentences in between songs. If it can effectively communicate emotion, this tech could, for example, “generate audiobooks on the fly that are really engaging and interesting,” said Jake Maymar, VP of Innovation at The Glimpse Group

The company launched its audiobooks offering in September in the US with more than 300,000 titles, and has since expanded to several other countries. Given that the sector is growing by 20% year-over-year according to Spotify, the company may be looking to expand. Another option, Maymar noted, is that this tech could apply to its creator studio, Soundtrap. “Emotion is a really hard problem to solve,” said Maymar. “If you build it right, it’s in demand.” 

There are, of course, downsides to this kind of tech, said Maymar. Any AI voice simulator that can create a realistic voice opens itself up to being used for deep fakes, he said. Plus, using AI to simulate voices could spell trouble for voice actors who literally rely on their voices to make a living. 

“If you can create a very realistic sounding voice that’s emotional, it has some negative use cases,” he said. “But with great power (comes) great responsibility.”

Have any comments, tips or suggestions? Drop us a line! Email at or shoot us a DM on Twitter @patentdrop. If you want to get Patent Drop in your inbox, click here to subscribe.