OpenAI DevDay: Announcements That Shape Audio & Video

On November 6 in San Francisco, OpenAI, which gave boom to the AI generative wave with its ChatGPT almost a year back, hosted its first-ever developer conference, where the team announced a bunch of new launches. The conference was packed with the announcements of new ChatGPT models and developer products.

We won't cover each of them in detail but will touch on only those that might have a direct impact on the future of audio and video.

New text-to-speech (TTS) model & API

Developers have gained the ability to create high-quality human-like speech from text using a text-to-speech API. The current text-to-speech model provides a choice of six preset voices and two model variants: tts-1 and tts-1-hd.

The tts-1 model is designed for real-time applications, while tts-1-hd prioritizes speech quality, offering a more natural and human-like sound at the expense of slightly increased latency. OpenAI's text-to-speech system also supports real-time audio streaming, and its pricing starts at $0.015 per input of 1,000 characters.

To put this in perspective, the leading text-to-speech service, ElevenLabs, is currently priced at $0.165 per 1,000 characters, making it more than ten times more expensive than OpenAI's offering. The API is now available for all developers.

Merged with OpenAI’s new powered GPTs vision that was also announced during the event, a future of OpenAI-driven voice assistants is what we are likely to expect in the months to come.

Whisper v3: Advanced speech recognition

Whisper, OpenAI's advanced Automatic Speech Recognition (ASR) model, made a significant mark in the speech-to-text industry after its open-source release in September 2022.

This release empowered numerous speech-driven applications. Now, OpenAI has unveiled Whisper large-v3, the next-generation ASR model that shows enhanced performance across multiple languages.

The tool is already available via the Whisper package on GitHub. API access will come in the "near future."

DALL-E 3 API

Developers can integrate DALL·E 3, which OpenAI recently launched to ChatGPT Plus and Enterprise users, directly into their apps and products through OpenAI Images API.

Companies like Snap, Coca-Cola, and Shutterstock have harnessed the power of DALL·E 3 to automatically generate images and designs for their customers and marketing campaigns. Just like the previous DALL·E version, the API includes a moderation feature to assist developers in safeguarding their applications from potential misuse.

The DALL·E 3 API offers various options in terms of image format, quality, and resolution, ranging from 1024×1024 to 1792×1024. Prices for generating images start at $0.04 per image.

However, unlike the DALL·E 2 API, DALL·E 3 cannot be used to create edited versions of existing images or generate variations of pre-existing images.

Copyright Shield for businesses

One of the most important updates announced at the developer conference was that OpenAI has introduced an important initiative known as Copyright Shield.

This program aims to provide safeguarding measures for businesses that rely on OpenAI's products, protecting them from potential copyright disputes. OpenAI has promised to cover legal expenses for customers who are utilizing its developer platform and ChatGPT Enterprise and find themselves facing intellectual property lawsuits related to content generated using OpenAI's AI tools.

💡

See also: How music tech streamlines creators’ copyright struggles & protect artists' rights with AI