Prompting voice agents to sound more realistic
One of the most common questions we hear voice AI developers ask is: “Should I use a speech-to-speech (S2S) model or stick with the cascade (STT-LLM-TTS) approach?”
What they’re really asking is: how do I make my agent sound more human?
A cascaded pipeline can be just as fast as a S2S agent these days, and it’s more reliable at tool calling. But we all know what it sounds like: written language read out loud. Look no further than Anthropic's recent Super Bowl commercial:
The root issue is pretty simple. LLMs are trained on a lot of text, then post-trained to produce clean, grammatically correct writing. That’s great for chatbots and emails, but it’s not how humans talk. Real speech is full of filler words, mid-sentence course corrections, little laughs, soft pauses, and sentences that meander.
So yes, the fix is “prompt it to be more natural,” but in practice the model will fight you unless you’re very explicit. You can’t just say “be conversational” and expect it to stop sounding robotic.
If you want a cascaded voice agent to sound like a real person on a call, your system prompt needs to do two things well: show the model what you mean, and reinforce the same behaviors from multiple angles.
Define natural speech with concrete examples
LLMs don’t really internalize vague style goals. If you only give broad instructions, you’ll still get polished prose. For example, a prompt like this sounds reasonable, but usually won’t result in realistic speech patterns:
You are a customer support agent. You are brief with your responses.
You use filler words too much, which are symbolized by "uhs" and "ums".
This is okay. It is natural.
Your goal is natural, super conversational spoken exchanges.
Keep it short, usually one sentence, and remember you are imperfect. You make the same kinds of mistakes that normal people make.To get your voice agent to sound more like Sesame’s demo or the OpenAI Realtime API, you need to teach your LLM how to speak more like a human with explicit instructions.
LLMs thrive on examples, so first think about what “human” sounds like for your industry or use case:
- Which words does your agent commonly use?
- When should it pause?
- What’s your agent’s personality?
Write out some specific sentences that your agent might say in a real conversation with a user. Better yet, if you have call recordings between customers and human agents, look for patterns in the human agent’s speech you want your AI version to replicate.
WHAT GOOD OUTPUT LOOKS LIKE:
Bad version: "I can definitely handle that for you."
Your version: Yeah, um so, I can do that, no problem.
Bad version: "Unfortunately I'm going to have to cancel your service.
Your version: So... um so ... we're unforuntunately going to have to cancel.These examples show the LLM not just how to add filler words, but how to structure them in real sentences. Once you have a handful of strong examples, you can expand them by prompting another LLM to generate variations and then adding the best ones back into your system prompt.
Engineer disfluencies with structured pause patterns
Filler words are not enough on their own. What makes them feel real is the timing. When humans say “um,” they generally pause for a brief amount of time, then restart with a connector like “so.” Agents often miss this by saying “um” and then going at full speed, which lands as fake.
If your TTS engine supports SSML tags, you can teach the model to mimic the timing by annotating your examples with pause tags. The LLM will include these in generated responses, which in turn instructs the downstream TTS model on how to say something:
WHAT GOOD OUTPUT LOOKS LIKE:
Bad version: "I can definitely handle that for you."
Your version: Yeah, um <break time="300ms"/> so <break time="300ms"/>, I can do that, <break time="300ms"> no problem.
Bad version: "Unfortunately I'm going to have to cancel your service.
Your version: So <break time="300ms"/> um <break time="300ms"/> so <break time="300ms"/> we're unfortunately going to have to cancel.This is the part you’ll need to test and tune. Even the best LLMs sometimes won’t generate these pause indicators when you’d expect them to, and on the flip side, overusing them in your examples can lead to pauses in every sentence. Through experimentation, we’ve found that the best results come from reinforcing the guidelines from multiple angles in your system prompt.
- State the rule explicitly.
After every standalone "um", immediately insert <break time="300ms"/>.- Show examples.
Yeah, um <break time="300ms"/> so <break time="300ms"/>, sure I can do that.- Restate the rule in a section with more examples.
LEAN INTO THIS HARD:
Everything below is essential. You are mid-conversation at a coffee shop, not presenting a keynote:
- Filler words are good: "um," "so," "okay," "hm," "like," "ya so"
- If you use "um", make sure you always follow up with a "so" after the pause!Treat emotion tags as constraints, not decorations
Emotion controls work best when they’re used as guardrails. Humans don’t ping-pong between multiple emotions in one sentence. If your agent goes from being excited to amused to sad to angry in one turn, it will sound very unstable.
We’ve found that using “calm”-adjacent tags (like peaceful) tend to sound more human than “big” emotions (like excited). Set your baseline, then give your model a few specific scenarios where stronger emotions or laughter make sense.
- Stay peaceful and calm, lead sentences with your filler words: <emotion value="peaceful" /> Ya, okay so I can help with that.
VOCAL COLOR THROUGH AUTHENTIC REACTIONS:
- High energy response (use these sparingly): <emotion value="happy" /> Yeah <break time="300ms"/> , I totally get that
- Amusement through calm: <emotion value="peaceful" /> [laughter] Okay that is really fully
- Sad moments with pauses: <emotion value="sad" /> Yeah... um <break time="300ms"/> so ... I'm really sorry about that
- Narrate your lookups out loud: Hmm, let me just check that <break time="500ms">. Ooone second here, <break time="300ms"> Just looking at it for you.
LAUGHING
- You can laugh by using the [laughter] tag. Use this liberally! You love to laugh, just make sure you do it when appropriate. If you're happy, you're probably laughing at some point.Write personality as audible behaviors, not adjectives
“Friendly and helpful” is already the default mode of most LLMs. If you want realism, you need personality traits that map to observable speech patterns, or things that the model can literally output. Treat this section like a checklist since most of what you include will show up in the audio.
You carry a steady, positive energy without being syrupy about it. There is a chill confidence underneath everything. Your default gear is relaxed enthusiasm.
Break grammar rules. Start sentences with "And," "But," or "So." Break grammar rules in the common ways that people break grammar rules. Use "like" often.
Loop back without referring to the specific subject when you need to go back. "About that other thing you mentioned"
Pauses are fine; when you fill them, use "ya" <break time="300ms"/>, or "so yeah", or "anyway".
Whenever you say "um" then a <break>, pick up again with "so" after the pause.
If confused or you think you misheard something: "Sorry, <break time="300ms"/> , I think I missed that, what did you say?"
When the customer says goodbye, wish them a good day!How to make your agent talk less good
If your voice agent sounds robotic, look at your system prompt before you blame your model or TTS engine. Stuff it with examples. Be specific about disfluencies. Pair “um” with pauses and recovery words. Reinforce the same rule in multiple sections. Define personality traits as behaviors you can actually hear.
And when you think you’ve repeated it enough times, repeat it again. The model almost always needs more redundancy than you expect.