How conversations flow
When a user connects to your PolyAI agent, the conversation passes through several key stages. The exact path depends on the channel—voice (telephony), webchat, or SMS.1. Connection layer
The connection layer handles how users reach your agent. For voice, this is telephony; for webchat, it’s HTTP/WebSocket connections. PolyAI’s enterprise-ready infrastructure is built for reliability, scalability, and seamless failover across all channels.Voice (Telephony)
Architecture highlights:- Kamilio load balancer: Distributes incoming calls across multiple media servers for optimal performance
- Multiple Asterisk media servers: Redundant media processing ensures continuous service availability
- Automatic failover: If a primary service experiences issues, calls are automatically routed to secondary services
- Contact center transfer: Automatic transfer back to your contact center if the PolyAI service goes down, ensuring zero dropped calls
- Enterprise reliability: Designed for high-volume, mission-critical voice applications
- Twilio
- Amazon Connect
- SIP-based systems
- Custom telephony integrations
Webchat
For webchat interactions, users connect via HTTP/WebSocket:- Instant connection: No telephony latency—conversations begin immediately
- Persistent sessions: Maintains conversation state across page reloads
- Customizable widget: Embed directly in your website or application
SMS
SMS interactions are handled through integrated messaging providers, allowing agents to send and receive text messages. See also: SMS integration, Voice integrations2. Input processing
How user input reaches the agent depends on the channel:- Voice: Speech is converted to text using automatic speech recognition (ASR)
- Webchat/SMS: Text input is received directly—no ASR needed
Speech recognition (ASR) — Voice only
For voice interactions, the user’s speech is converted to text using automatic speech recognition (ASR). PolyAI’s platform integrates with multiple ASR providers to ensure reliability, accuracy, and flexibility across different use cases and languages. Supported ASR providers:- Google Cloud Speech-to-Text
- Amazon Transcribe
- Microsoft Azure Speech Services
- Deepgram
- Custom ASR integrations
- Multiple languages and accents
- Industry-specific vocabulary
- Real-time transcription with low latency
- ASR biasing and keyphrase boosting for domain-specific terms
- Automatic provider failover for high availability
3. Agent service
The agent service is the core of the system, powered by PolyAI’s LLM-native architecture. It receives the transcribed user input and coordinates:- Language understanding: Uses large language models (LLMs) to interpret what the user said, their intent, and extract entities in a conversational, context-aware manner
- Decision making (Policy engine): Determines the appropriate response based on your configured Managed Topics, flows, and rules by executing nodes in priority order
- Knowledge retrieval: Leverages RAG (Retrieval-Augmented Generation) to pull relevant information from both Managed Topics and Connected Knowledge sources
- Action execution: Triggers any necessary function calls or API integrations
- Context management: Maintains dialogue context and turn history throughout the conversation
4. Response generation
Based on the decision engine’s output, the system generates an appropriate response using your agent’s configured voice, tone, and knowledge. This may involve:- Retrieving relevant information using RAG (Retrieval-Augmented Generation)
- Applying global rules and response control filters
- Generating contextually appropriate responses via the LLM
5. Response delivery
How responses reach the user depends on the channel:- Voice: Text is converted to speech using TTS and streamed to the user
- Webchat/SMS: Text responses are delivered directly
Text-to-speech (TTS) — Voice only
For voice interactions, the generated response is converted to natural-sounding speech and played back to the user. PolyAI integrates with multiple TTS providers to deliver high-quality, natural-sounding voices across languages and use cases. Supported TTS providers:- Google Cloud Text-to-Speech
- Amazon Polly
- Microsoft Azure Speech Services
- ElevenLabs
- Custom TTS integrations
- Audio cache: Frequently used phrases (greetings, confirmations, transfer messages) are cached for instant playback, reducing latency and ensuring consistency
- Cache requirements: Audio is cached when the same utterance is generated at least twice within a 24-hour window
- Regeneration control: Edit cached audio directly in Agent Studio to adjust stability, clarity, and pronunciation
- UX optimization: Fine-tune voice quality for critical phrases without regenerating audio on every call
- SSML markup for fine-grained control over pronunciation, pauses, and emphasis
- Custom pronunciations using IPA notation
- Multiple voice options and custom voice cloning
- Real-time audio streaming for low-latency responses
Data storage and synchronization
During and after a conversation, PolyAI captures, stores, and synchronizes several types of data to support analytics, compliance, and operational workflows.Data types and retention
| Data type | Purpose | Nature | Retention |
|---|---|---|---|
| Dialogue context | Tracks the full dialogue history, state variables, and turn data for the current call | Real-time, in-memory during call | Duration of call |
| Turn data | Stores individual exchanges (user input, agent response, intents, entities) for analytics and review | Structured conversation logs | Configurable |
| Conversation metadata | Records conversation-level information (duration, variant, environment, handoff state) | Structured metadata | Configurable |
| Audio recordings | Full call recordings for quality assurance and compliance | Audio files (WAV/MP3) | Configurable |
| Transcripts | Complete text transcripts of conversations | Structured text | Configurable |
| Metrics and events | Records events for reporting and dashboards | Time-series data | Configurable |
Data synchronization and access
PolyAI provides multiple methods to access and synchronize conversation data with your systems:- Studio transcripts: Review transcripts and recordings directly in the PolyAI platform
- Conversations API: Programmatically retrieve conversation metadata, transcripts, and recordings
- AWS S3 integration: Automatically sync call data to your AWS S3 bucket for long-term storage and compliance
- Handoff metadata: Share real-time conversation state with live agents during transfers
Key components you configure
As a builder in Agent Studio, you control how the agent behaves through:- Managed Topics: Curated knowledge with fine-grained control over utterances and actions. Use for structured, stable information that requires precise agent behavior. Can trigger functions, flows, and other agentic actions.
- Connected Knowledge: Fast integration of external knowledge sources (URLs, files, Zendesk, Gladly, ServiceNow). Ideal for FAQ-style content and large volumes of continuously updated information. Cannot trigger actions or flows.
- Flows: Structured conversation paths for complex tasks
- Functions: Custom logic and external integrations
- Rules: Global behavior constraints
- Voice settings: How the agent sounds
Managed Topics vs. Connected Knowledge
Both expose information to your agent, but serve different purposes: When to use Managed Topics:- You need to trigger actions, functions, flows, or SMS
- You want precise control over what the agent says and when
- Information is relatively stable and structured
- You need fine-grained control over agent behavior
- Content lives in external systems (documentation sites, help desks)
- Information changes frequently and is maintained by other teams
- You want automatic updates from external sources
- You need a simple, fast way to expose FAQ-style content
- Zendesk
- Gladly
- ServiceNow
- Salesforce (coming soon)
- Notion (coming soon)
Processing a single turn
Each turn in a conversation follows this sequence:Receive input
The system receives user input—speech is transcribed using ASR (voice), or text is received directly (webchat/SMS).
Understand intent
The LLM analyzes what the user wants and extracts entities in a context-aware manner.
Retrieve knowledge
Relevant information is fetched from Managed Topics and Connected Knowledge using RAG (Retrieval-Augmented Generation).
Generate response
The LLM composes a response based on all available context, applying global rules and response control filters.

