Bringing AI avatars to voice agents

David Zhao

May 8, 2025 • 3 min read

Introducing integration with Tavus

Video avatars aren't just gimmicks—they've become genuinely useful tools that developers and businesses actually want. We've been hearing this a lot from customers, especially in education, healthcare, mental wellness, and marketing. Everyone seems keen on turning their voice interactions into something more visual and engaging.

Why avatars?

While voice is the primary medium for how humans communicate, visual cues also play an important role. Facial expressions, gestures, and eye contact add context that helps convey intent, emotion, and engagement. When integrating AI voice agents into applications, adding visual avatars can improve user experience by making interactions feel more natural and engaging.

Integrating avatars into the Agents Framework

We're introducing a simple way to add AI avatars to your LiveKit agents, starting with support for Tavus, a leading provider of realtime AI avatar models.

The integration is designed as a plugin that captures an agent’s audio output and forwards it to the avatar model for video synthesis. This approach minimizes boilerplate while preserving full control over your agent logic.

You can attach a Tavus avatar to any LiveKit agent with just a few lines of code.

  avatar = tavus.AvatarSession(
      replica_id="r4c41453d2", 
      persona_id="p2fbd605"
  )
  await avatar.start(session, room=ctx.room)

On the frontend, avatars are published to the room just like any other video track on LiveKit. This means they can be rendered with any of LiveKit’s client SDKs. We've also added support for avatar rendering in our voice AI starter app. Below is the full code example for rendering either an agent avatar or an audio visualizer in the same UI:

function AgentVisualizer() {
  const { state: agentState, videoTrack, audioTrack } = useVoiceAssistant();

  if (videoTrack) {
    return (
      <div className="h-[512px] w-[512px] rounded-lg overflow-hidden">
        <VideoTrack trackRef={videoTrack} />
      </div>
    );
  }
  return (
    <div className="h-[300px] w-full">
      <BarVisualizer
        state={agentState}
        barCount={5}
        trackRef={audioTrack}
        className="agent-visualizer"
        options={{ minHeight: 24 }}
      />
    </div>
  );
}

Behind the scenes

Avatar generation introduces additional processing, so we designed the system with a strong focus on minimizing latency. Even small delays can disrupt the natural flow of a conversation. To reduce overhead, we applied several optimizations—some of which take advantage of LiveKit’s unique capabilities.

Remote generation

Most avatar models require specialized GPUs and typically run on a separate machine from the agent code. To support this, the avatar system is designed to offload video synthesis to a remote server.

A naive implementation might follow this workflow:

Capture the agent’s audio output and send it to a remote server via websocket
Receive the generated video from the remote server
Publish both audio and video via WebRTC

In this setup, step 2 becomes a bottleneck for two reasons:

Increased latency: Waiting to receive video frames before republishing introduces a delay.
Additional encoding overhead: Transmitting the video over the network requires encoding and decoding, which further increases latency and can degrade quality.

To avoid incurring additional latency, we took a more efficient approach:

The avatar generation server joins the same room as a participant
The agent sends audio output to the generation server via a ByteStream
The generation server directly publishes synchronized audio and video into the room

This setup avoids redundant encoding and transmission delays, while maintaining real-time performance and quality.

Handling interruptions

Interruption handling is critical in conversational AI. Users may interrupt an agent at any time—to ask a question, change topics, or clarify something. LiveKit Agents natively support interruptions, but introducing a remote avatar server adds complexity: how does the server detect interruptions, and how can it maintain smooth playback when pre-synthesized frames need to be discarded?

To address this, we use LiveKit’s RPC system, which allows participants to invoke remote functions on other participants in the room. This lets the agent notify the avatar server of interruption events in real time, so it can cancel or adjust ongoing video synthesis as needed.

When the avatar server receives an interruption event, it discards any pre-synthesized frames and begins generating new frames that reflect the avatar’s listening state. This transition must be precisely timed to avoid visible glitches or jerky video playback.

Coordinating playback state

In voice-based interactions, it's important for agents to know when a response has been fully heard. Unlike text, where content is delivered instantly and cannot be interrupted, audio playback happens in real time—making timing a critical part of the conversation flow. Playback completion also signals when content should be committed to the chat context.

To maintain compatibility with our SpeechHandle APIs—which allow agents to await audio playout—we need to track when the remote synthesis server finishes playback.

To do this, we again use LiveKit’s RPC system—this time in reverse. When playback completes, the avatar server sends an RPC to the agent. The agent receives this signal and uses it to resolve internal state and APIs accordingly.

Wrapping up

Adding avatar support to LiveKit Agents presented unique challenges, especially around latency, interruptions, and synchronization. We had quite a bit of fun building this integration to handle those complexities for you, making it easier to create responsive, realtime avatar experiences.

Give it a try with Tavus, and let us know how it works for your use case!