LiveKit SDK for ESP32: bringing voice AI to embedded devices

LiveKit SDK for ESP32: bringing voice AI to embedded devices

Since the launch of the LiveKit Agents framework, we’ve seen developers build voice AI experiences on web pages, mobile apps, and even embedded Linux devices like the Raspberry Pi Zero 2W. But we kept getting asked: can LiveKit run on even smaller microcontrollers like the ESP32? Can you build a microcontroller AI agent?

We listened. Over the past several months, we’ve worked closely with Espressif Systems to bring a full-featured LiveKit SDK to the ESP32 platform. Built on top of Espressif’s hardware-optimized WebRTC and media components, the SDK ensures full compatibility with other platforms and reliable performance. With our latest release, ESP32 developers can build voice AI interactions with the same features and functionality available in our other client SDKs.

Challenges with WebRTC for embedded devices

WebRTC, as its name suggests, was originally designed for the web, targeting devices with ample memory and processing power. On embedded platforms, however, tight memory constraints and the absence of hardware-acceleration for media processing made deploying WebRTC impractical. Espressif, in the past year, has released several ready to use components such as esp_cature & esp_peer to make capturing media & publishing to a WebRTC server easy to add to your embedded ESP32 project.

Building on these advances, our ESP32 WebRTC implementation has been carefully optimized to minimize the memory required to establish and maintain a room connection. At the protocol level, Protobuf encoding and decoding are tightly managed to reduce overhead: partial decoding skips unused fields, while dynamically sized fields (i.e., strings and repeated fields) use stack allocation whenever possible.

Why ESP32?

With over a billion ESP32 powered devices worldwide, the platform already has a thriving ecosystem of developers and projects. Espressif has invested heavily in robust, well-supported libraries for everything from WebRTC media streaming and audio processing algorithms to audio/video encoding. And with the latest generation ESP32-P4, realtime IoT voice AI is now possible on devices powered by a chip costing less than ten dollars.

ESP32 + LiveKit Use Cases

  • Smart voice assistant device: Low-cost portable voice assistant hardware that connects users with AI agents in the cloud.
  • Video interface with AI avatar: Devices with camera & displays that connects an AI video avatar with devices for callboxes, drive-thru ordering, virtual doorman, etc.
  • Smart security camera: Devices that can stream video to AI agent in cloud to process for real-time scene understanding and semantic metadata.

LiveKit SDK for ESP32-S3 & ESP32-P4 Features

With this release of the LiveKit SDK for ESP32, we have enabled the core set of features needed for voice agents, including:

  • Support for popular ESP32-S3 & ESP32-P4 devkits
    • Support for a wide range of cameras, audio codecs, and displays
  • Bi-directional audio streaming with Opus encoding
  • Video streaming with hardware H.264 encoding on ESP32-P4 boards
  • Support for all providers & models as our standard SDKs
  • Data message publishing and receiving
  • RPC method registration & calling

Coming soon

  • Bi-directional video streaming
  • Video avatar support

Adding support for ESP32 was a challenge, especially in making sure everything still worked well and performant in a very resource constrained environment. We’re excited to see what amazing projects will be built when low cost embedded devices can be easily connected to multi-modal AI agents in the cloud.

If you haven’t already, join our Slack community and use the #robotics channels to ask questions and share feedback.

Getting Started