Skip to main content
How a PolyAI agent processes conversations from input to response. PolyAI Voice Agent Conversation Flow Expand the image to zoom.
The agent’s initial greeting is hardcoded and sent directly without running the LLM or processing any Rules. For voice, this is sent to TTS (Text-to-Speech); for webchat, it’s displayed as text. Write the greeting in the language you expect users to receive. Any rules and logic begin after the greeting.

Processing stages

A conversation moves through the following stages:
  • User: The user provides input—speech (voice) or text (webchat/SMS).
  • Input capture: For voice, the audio stream is captured and sent for transcription. For webchat/SMS, text is received directly.
  • ASR Provider (voice only): The system receives the raw audio.
  • ASR Service (voice only): Converts the audio into text using automatic speech recognition.
  • ASR Processing (voice only): Searches for transcription issues and applies any relevant corrections.
  • Transcript/Text → Processed Input: The processed input is passed to Retrieval.
  • Retrieval: Pulls relevant topics retrieved from the knowledge base using RAG (retrieval-augmented generation) to provide context for the response.
  • Compute Prompt: The system builds an LLM prompt using retrieved topics, system knowledge, and conversation history.
  • Run LLM: The LLM processes the request and determines whether to return:
    • Returned Text: A direct text response.
    • Returned Function: A function call.
  • Execute Function (if applicable): Runs the function and passes the result back to the LLM.
  • LLM Refinement: If a function result is returned, the LLM updates its response before proceeding.
  • Chunk LLM Output: The response is broken into chunks for delivery.
  • Postprocess Chunks: Applies rules such as stop keywords to remove unnecessary phrases.
  • Stream Partial Responses: The system sends chunks as soon as they are ready, rather than waiting for the full response.
  • TTS Service (voice only): Converts text chunks into speech using text-to-speech synthesis. Configure voices in voice settings.
  • Response delivery: For voice, synthesized speech is streamed to the user. For webchat/SMS, text responses are sent directly.
  • Live Handoff (if applicable): If escalation is needed, the agent triggers a live handoff. For voice, this transfers the call; for webchat, this can route to a live chat agent.
  • Conversation Logs: The system stores conversation history and logs for analytics.
  • Final Response: The user receives the completed response as it streams, without waiting for the entire message.

Advanced: How response streaming works

PolyAI agents don’t wait for the full response before speaking. Instead, responses are processed and streamed in real time:
  • LLM Streaming: Words are generated and sent continuously.
  • Chunking: Responses are broken into chunks for controlled delivery.
  • Postprocessing: Stop keywords remove unnecessary phrases before delivery.
  • Response Streaming: For voice, users hear speech as soon as it’s processed via TTS. For webchat, text appears progressively as it’s generated.

Watch it in action

This video visualizes the conversation flow, showing how responses are processed, chunked, and streamed: