OpenAI rolled out next-generation audio models in the API, including a new speech-to-speech model called GPT-Realtime. The updated Realtime API is now GA and adds support for remote MCP servers,…

OpenAI Launches New Voice Intelligence Features in Its API

Published: 2026-05-07

Summary

OpenAI rolled out next-generation audio models in the API, including a new speech-to-speech model called GPT-Realtime. The updated Realtime API is now GA and adds support for remote MCP servers, image inputs, and SIP-based phone calling. GPT-Realtime is positioned as cheaper and more reliable than prior top-tier voice offerings, enabling developers to build low-latency voice agents that can follow complex instructions, switch languages mid-sentence, and sound more natural. This removes the previous multi-step pipeline (STT → LLM → TTS) for many use cases.

Key Data Points

New model: GPT-Realtime (speech-to-speech)
API updates: Realtime API GA; remote MCP servers, image inputs, SIP phone calling
Benefits: Lower latency, cost reduction, natural expressive speech, multilingual mid-conversation switching
Use cases: Voice customer service bots, travel/booking agents, real-time assistants

Enrichment Snippets

OpenAI: “Today, we’re launching new speech-to-text and text-to-speech audio models in the API—making it possible to build more powerful, customizable …”
Inc: “OpenAI Announces GPT-Realtime, Its Best Voice AI Model Yet … cheapest voice AI model yet.”
Salesforce Ben: “The company’s Realtime API … supports remote MCP servers, image inputs, and phone calling through SIP, making voice agents more capable.”

Relevance

Impact: MEDIUM — Major developer platform update; lowers barrier to production voice agents and expands agentic AI modalities.