Jarvis. Samantha. Joi. HAL.
While we’ve enjoyed exchanging texts with ChatGPT, the LiveKit team thought it would be more fun to see and speak with it. Conveniently, we work on a suite of developer tools for building real-time video and audio applications which made building this much easier.
Meet KITT (Kaleidoscopic, Interconnected Talking Transformer): an AI that you or your group can have live conversations with. KITT can do a lot of neat things including:
- Answer questions like Siri, Alexa, or Google Assistant
- Take notes on or summarize what was discussed in a meeting
- Speak multiple languages and even act like a third-party translator
The rest of this post will dive into how KITT works under the hood, but if you’re eager to jump straight to the code, that’s here: https://github.com/livekit-examples/kitt
We wanted to keep the client as “thin” as possible. Ideally, KITT, or any other bot built by another developer could hook into a LiveKit session and publish a video and/or audio track, analogous to a human user sharing their camera or microphone streams.
KITT also needs to pull down audio streams from every user in the session in order to convert that speech to text and potentially dispatch a prompt to GPT. We used LiveKit’s Go SDK which packages in Pion, allowing us behave like a WebRTC client and join sessions from the backend. Overall, the application architecture looks like this:
Whenever a new session starts and the first user joins, we use a webhook to have KITT join it too:
Now that KITT is connected to the session, they need to subscribe to all audio tracks from each user:
With audio streaming in, this is where things get interesting…
Optimizing for latency
We wanted conversations with KITT to feel human, thus our primary concern before writing any code was latency. In particular, there needed to be as little delay as possible between when a user spoke and when KITT responded. Until someone (OpenAI?) builds an audio-to-audio model, there are three spots in our pipeline where latency might creep in: 1) speech-to-text (STT) 2) GPT 3) text-to-speech (TTS).
We evaluated a few services for STT including Google Cloud, DeepGram, Web Speech, and Whisper. In service of making interactions with KITT feel more human, we were willing to sacrifice transcription accuracy for lower latency. DeepGram’s model seems to have good accuracy but in our trials was much slower than Google’s service. OpenAI’s cloud-hosted Whisper API doesn’t currently support streaming recognition so that was a non-starter: while GPT needs to be fed full-text prompts, capturing incremental transcriptions is faster than sending one long speech segment.
If everyone was accessing KITT via Chrome, the Web Speech API would be a decent choice but it’s not standardized across browsers. Some browsers use an on-device model, which is faster than processing speech on the server, but accuracy suffers disproportionately. Client-side STT also goes against our modular design goal (i.e. fully server-side bots).
Ultimately, Google Cloud’s STT was fast, accurate, and supported streaming recognition. When sending audio to the STT service, Google recommends a 100ms frame size for balanced latency and accuracy. In practice, we found that a 20ms (WebRTC’s default encoding) frame size was sufficient for our use case.
You can check out our implementation here:
GPT takes in text prompts and spits out text responses, but each model has different attributes:
We went with GPT-3.5 because it favors speed but we had to make some tweaks. Tokens outputted by GPT are streamed, and like STT, text-to-speech is also faster when performed incrementally versus generating speech from one large blob of text. However, unlike the STT phase, this is the final stage in our pipeline and we don’t have to block on the full TTS output; we can stream audio segments as they’re generated in real-time to each user. In order to avoid choppy or uneven speech, we chose to delimit audio segments by sentence:
The problem with delimiting by sentence is the model we’re using isn’t very concise in its responses, which increases the TTS latency. To work around this, we prime GPT like so:
There’s probably better prompts to elicit shorter sentences from GPT, but the above performs well in practice. From here, it’s relatively straightforward to run each sentence through Google’s TTS and transmit audio responses to each user in the session:
The general principle we followed to optimize for latency was streamall the things: by minimizing the amount of time it takes to receive, process, and transmit data at each step, we were able to keep latency to a minimum and create a seamless conversational experience.
Designing the client interface
With a working backend for KITT which could plug into any LiveKit session, instead of building a completely new UI, we saved a lot of time by using LiveKit Meet for this project. Meet is a Zoom-inspired sample application we built to show developers how to use LiveKit and for internally dogfooding our infra. In this demo, when you start a meeting, KITT will automatically join it too.
If it’s a 1:1 meeting (i.e. only one human user), KITT will assume anything you say is directed at them and respond appropriately. If there are multiple human users in the meeting, saying “KITT” or “Hey KITT” will let KITT know your subsequent prompt is intended for them — we also play a sound to let you know KITT's listening.
Borrowing from assistant interfaces like Siri and Google Assistant, whenever you speak to KITT, we show a live transcription of your prompt which helps you understand the input KITT received and contextualize their response:
For KITT’s visual identity, ultimately we’d like to dynamically generate video frames on the backend, but for this initial version we opted to build a client-side React component. KITT cycles through a few different states, depending on the state of the conversation and whether they're engaged:
Both live transcription and KITT’s states are transmitted to each user using LiveKit’s data messages.
Putting it all together
With everything working end-to-end, the result is pretty magical. ✨
Below are some areas that we’d like to explore as improvements to KITT’s implementation — if you’re interested in contributing towards any of these, we accept PRs! 🙂
Google’s STT is fast and accurate, but there are two downsides: 1) it’s an external service call 2) it’s not cheap. We wanted to explore running the smallest (i.e. fastest) model of Whisper ourselves to see if there’s a significant reduction to latency. The limitation of Whisper is it doesn’t have as broad language support as Google.
Google’s TTS is also fast, but the voice can be a bit robotic-sounding. We explored Tortoise which sounds amazing, but it takes ~20s to generate a single sentence! It’s worth testing other real-time TTS models like Rime, or even ones supporting custom voices which would be fun for end-users to interact with.
GPT-4 and other models
GPT-4 is a more powerful model, capable of producing more humanlike responses at the cost of speed. It would be interesting to see if pre-prompting it (to be extra concise) could help reduce the increased latency. Additionally, there are really fast models like Claude, or possibly running LLaMA or Alpaca locally, which could potentially achieve even lower latency without a significant impact on response quality.
More and better prompting
In this demo, we’ve barely scratched the surface of what pre-prompting GPT can do. We added the ability for every user to specify their spoken language and we prepend the language code (e.g. en-US) to every user-initiated prompt. This allows KITT to respond to each user in the appropriate language and even act as a live translator between two users.
For every user query, we also pass GPT the entire conversation history (e.g.
russ: ...\ntheo: ...) so KITT can respond to or reference any prompts which rely on historical context.
Some ideas for taking this further include framing the session to GPT as a meeting, including any notes from emails or calendar entries, and adding real-time events to the history like
theo left the meeting at <timestamp> or live chat messages.
Giving KITT or another AI an expressive (and perhaps more human) visual representation will completely change how it feels to interact with them. The tricky part on the backend is setting up a pipeline which can take in an audio stream, supports compositing/animation/effects, and outputs video frames. One option is to record a browser instance like we do with LiveKit Egress. Another possibility is to use Unity or Unreal.
Right now we aren't doing anything with a user's video stream. While transcriptions help with accessibility, imagine running a separate model that performs ASL recognition! Other things like sentiment analysis or scene understanding could provide GPT additional context, too.
Some GPT responses include multimedia like videos, images, or code snippets. A neat feature would be to initiate a screen share or add some type of canvas to LiveKit Meet that KITT can use to display these types of assets to users.
It was insanely fun building KITT and the very first conversation with them gave us goosebumps. There’s definitely potential for a few standalone products to be built on top of this foundation. Consider this an open invitation to take our code and build something amazing with it. If you have any questions along the way, want to jam on ideas, or just share what you’ve built, hit us up in the LiveKit Community Slack!