
Metrics
- Caller utterance risk level: Shows how risky incoming messages are and how well the agent manages them.
 - Total calls: Total number of calls during the selected period.
 - Number of calls managed for risk: How often the safety filters were triggered.
 - Percentage of calls managed for risk: How many of those calls involved flagged content.
 - Distribution of flagged calls: Highlights trends in flagged calls over time.
 - Distribution count of flagged calls: Shows peaks in flagged call volume.
 - Caller utterance category distribution:
- Broken down into hate, self-harm, sexual content, and violence.
 - Uses color-coded visuals for easy tracking.
 
 
Editing safety filters
To manage your filters, go to Settings in the sidebar.
How filters work
Content filters run on both sides of the conversation:- User input: Catches toxic or inappropriate speech before it reaches the agent.
 - AI output: Prevents the agent from responding with anything unsafe or non-compliant.
 
Filtering categories and severity levels
Filters target four core risk categories:- Hate
 - Sexual
 - Violence
 - Self-harm
 
- Safe (label only — no filtering)
 - Low (most content allowed)
 - Medium (balanced filtering)
 - High (strict filtering)
 
Category details
| Category | Description | 
|---|---|
| Hate | Covers content that attacks or discriminates based on race, ethnicity, nationality, religion, gender identity, sexual orientation, disability, or appearance. Includes bullying, harassment, and slurs. | 
| Sexual | Content involving explicit anatomy, sexual acts, or romantic/erotic themes — including abusive or exploitative content. Includes vulgar language, nudity, child exploitation, and grooming. | 
| Violence | Covers physical harm, threats, weapons, terrorism, and other violent acts or intimidation. Includes mentions of guns, attacks, or stalking. | 
| Self-harm | Mentions of suicide, self-injury, eating disorders, or any content about hurting oneself. | 
Additional filtering
- Jailbreak risk detection: Filters also watch for attempts to bypass or disable safety features.
 
Language support
Content filters have been trained and tested in the following languages:- English
 - German
 - Japanese
 - Spanish
 - French
 - Italian
 - Portuguese
 - Chinese
 
Best practices
- Test thoroughly: Always run your own tests to validate how filters behave with your content.
 - Use the right level: Don’t default to High — find a balance that avoids both harm and over-filtering.
 - Standardize features: If you’re using filters in templates or shared projects, try to use the same flows and function names across them.
 

