OpenAI and LiveKit partner to turn Advanced Voice into an API

LiveKit and OpenAI are partnering to help you build your own apps using the same technology powering ChatGPT’s new Advanced Voice feature.

There’s a philosophy at OpenAI that I love:

Give developers access to the same tools and APIs that we use in our own apps.

It’s with that spirit that today, we’re announcing a partnership between LiveKit and OpenAI to give you the tools to build your own apps using the same end-to-end technology powering ChatGPT’s new Advanced Voice feature.

We are releasing a new Multimodal Agent API which has built-in support for OpenAI’s brand new Realtime API. You can now build apps with GPT-4o that listen and speak to users in realtime, just like ChatGPT.

You can start experimenting with these new capabilities in our open source playground or jump into our guide to start building your first realtime voice AI app. Read on to learn how we designed and built this new API with OpenAI.

How Advanced Voice works


When you use Advanced Voice, ChatGPT understands not just what but how you say things, and can respond in ~300ms (the latency threshold for human speech) expressing a range of human emotions. Here’s how the feature works under the hood:

High-level architecture diagram of Advanced Voice

  1. A user's speech is captured by a LiveKit client SDK in the ChatGPT app.
  2. Their speech is streamed over LiveKit Cloud to OpenAI’s voice agent.
  3. The agent relays the speech prompt to GPT-4o.
  4. GPT-4o runs inference and streams speech packets back to the agent.
  5. The agent relays generated speech over LiveKit Cloud back to the user’s device.

Facilitating the workflow above requires a different architecture than the traditional HTTP request-response model used when you text with ChatGPT.

For communication between their voice agent and GPT-4o, OpenAI uses a WebSocket, which they are now exposing to developers in the Realtime API. The voice agent constantly streams audio input to GPT-4o, while simultaneously receiving audio output from the model.

For server-to-server applications where there is little to no packet loss, a WebSocket is sufficient. However, for an end user application like Advanced Voice, we need to obtain audio input from and stream audio output to a client device. WebSocket isn’t the best choice for this because of how it handles packet loss. The vast majority of packet loss occurs between a server and client device and WebSocket doesn’t provide programmatic control or intervention in lossy network environments like WiFi or cellular. Packet loss leads to higher latency and choppy or garbled audio.

To overcome this limitation, OpenAI uses another protocol named WebRTC. WebRTC was specially designed for transferring audio with ultra-low latency between servers and clients. Protocol implementations have built-in codec support, adapt bitrates to network conditions, and are available across most platforms. However, using it directly comes with a lot of complexity and scaling challenges.

LiveKit is open source infrastructure that simplifies WebRTC and LiveKit Cloud is a global network of servers optimized to reliably route audio with the lowest latency possible, at large scale.

To send and receive audio from the end user’s device, OpenAI integrates a LiveKit client SDK into the ChatGPT app. On the backend, another LiveKit SDK designed for using WebRTC in server environments, receives streaming audio from the user and streams audio back to them. Both user and agent connect to one another using WebRTC over LiveKit Cloud.

Using LiveKit with OpenAI’s Realtime API


Today, we’re launching a new API in the Agents framework and new frontend hooks and components which makes it easy for anyone to build a product like Advanced Voice.

Backend


The new Multimodal Agent API available in Python and Node is designed to completely wrap OpenAI’s Realtime API, abstracting away the raw wire protocol, and provides a clean “pass-through” interface to GPT-4o.

The MultimodalAgent class dynamically handles both streaming text and audio modalities. You can send either modality as input and and receive output in either modality. When using voice, the API automatically time-aligns text transcriptions with audio as it’s played back on a user’s device. This comes in handy for features like closed captioning. If your user interrupts GPT-4o by speaking during playback, the API will detect it and synchronize state with GPT-4o to ensure its context window is rolled back to the point of interruption.

All Realtime API parameters like voice selection, temperature, and turn detection config are transparently supported. Regarding turn detection, we’ll soon provide agent-side turn detection with the same defaults as OpenAI and expose even more settings.

MultimodalAgent also integrates with existing features of the Agents framework:

  • Buffered playback. GPT-4o generates audio faster than it can be played back. Our SDKs automatically buffer, stream, handle user interruptions, and play back audio with the correct timing.
  • Function calling. You can define parameterized functions with invocation triggers specified in natural language. The Agents framework will map a GPT-4o tool call to your function and invoke it with the proper parameters.
  • Load balancing. When you deploy your agents, they register with LiveKit servers. Once your user connects, a LiveKit server will dispatch a nearby agent, monitor its health, and handle failover and reconnections.
  • Integrated telephony. LiveKit has a telephony stack integrated with the Agents framework. Once you provision and configure a phone number with a provider like Twilio or Telynx, your users can talk to GPT-4o over the phone.

Here’s a simple example using a MultimodalAgent in Python:

from livekit.agents import MultimodalAgent, AutoSubscribe, JobContext, WorkerOptions, WorkerType, cli
from livekit.plugins.openai.realtime import RealtimeModel

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    agent = MultimodalAgent(
        model=RealtimeModel(
            instructions="...",
            voice="alloy",
            temparature=0.8,
            modalities=["text", "audio"],
        )
    )
    agent.start(ctx.room)

if __name__ == "__main__":
    cli.run_app(WorkerOptions(
	    entrypoint_fnc=entrypoint, 
	    worker_type=WorkerType.ROOM
	  ))
Load balancing is built directly into LiveKit’s media server, so you can run this agent the exact same way on localhost, a self-hosted deployment, or on LiveKit Cloud.

Frontend


On the frontend, we have new hooks, mobile components, and visualizers that simplify wiring up your client application to your agent. Here’s an example in NextJS:

import { LiveKitRoom, RoomAudioRenderer, BarVisualizer, useVoiceAssistant } from '@livekit/components-react'

function MultimodalAgent() {
  const { audioTrack, state } = useVoiceAssistant()
  return (
    <div>
      <BarVisualizer trackRef={audioTrack} state={state} />
    </div>
  )
}

export default function IndexPage({ token }) {
	return (
		<LiveKitRoom serverUrl={process.env.NEXT_PUBLIC_LIVEKIT_URL} connect={true} audio={true} token={token}>
			<MultimodalAgent />
		  <RoomAudioRenderer />
		</LiveKitRoom>
	)
}

For examples of code similar to the above running in production, check out LiveKit’s homepage or the open-source Realtime API playground. We also have a more detailed end-to-end guide on building your own agent using the Multimodal Agent API.

Use cases


With the capabilities of Advanced Voice now available as an API, we’re going to see developers build incredible AI-native applications. GPT-4o’s joint training across multiple modalities significantly reduces inference latency and imbues it with the ability to understand and communicate with human emotion across languages. There’s a few areas where I think we’ll see this new technology immediately applied:

  • Customer support and Telehealth. Voice (over the phone) is the default modality in these industries and having a flexible, more empathetic automated service provider will be a big UX upgrade to existing phone trees and IVR systems.
  • Language learning. The fastest way to learn a new language is to be immersed in it —languages around the world vary in both structure and how they’re spoken. Before now, there was no way for AI to approximate a native speaker.
  • Video game NPCs. Imagine rich, dynamic storylines throughout a game world, brought to life by humanlike characters you encounter and converse with in realtime.
  • Therapy and meditation. Our internal thoughts are one of the most quintessential aspects of the human experience. Being able to—without fear of judgement—engage with an AI to practice mental health will have a broad positive impact on society.

Whether you’re working on one of these or other ideas using OpenAI’s Realtime API and LiveKit, we’d love to hear from you and help however we can. Come say ‘hi’ in our Slack community or DM/mention @livekit on X. We can’t wait to see what you build!