Authors are listed in alphabetical order.
Introduction
As enterprises adopt AI assistants, evaluating how well these agents handle real-world tasks — especially through voice interfaces — has become crucial. Traditional benchmarks largely focus on general conversational abilities or narrow tool-use scenarios, leaving a gap in assessing AI performance in complex, domain-specific workflows.
This prompted the Salesforce AI Research & Engineering teams to create a comprehensive benchmark specifically designed to evaluate how AI agents perform across complex enterprise workflows in both text and voice environments. Salesforce uses a proprietary version of this system to develop and test our AI agent products, such as Agentforce.
What are we benchmarking?
We have created a standardized evaluation framework designed to measure AI assistants’ effectiveness across four core enterprise domains: healthcare appointment management, financial transactions, inbound sales, and e-commerce order processing. Developed with human-verified test cases, the benchmark challenges agents to execute multi-step tasks, invoke domain-specific tools, and maintain security protocols on both the text and voice communication surfaces.
Why enterprise benchmarks matter
Most existing conversational AI benchmarks evaluate general knowledge or simple instruction following. However, enterprise contexts demand:
- Tool Integration: Complex operations frequently require invoking multiple APIs or systems in sequence.
- Protocol Adherence: Security and compliance protocols are non-negotiable, especially for sensitive operations like finance and healthcare.
- Domain Expertise: Agents must understand specialized terminology and workflows.
- Voice Robustness: Speech recognition and synthesis errors can cascade in multi-step processes.
Targeting these requirements fills a gap and guides developers toward more capable, reliable enterprise assistants.
Core architecture
Our modular architecture comprises four key components:
- Environments – Domain-specific contexts (e.g., appointment scheduling, credit card services) with unique functions and data schemas.
- Tasks – Predefined scenarios specifying client goals, expected function calls, and success criteria.
- Participants – Simulated client-agent interactions that mirror realistic conversation flows.
- Metrics – Objective measures of accuracy (correct task completion) and efficiency (token usage and conversational turns).
This design enables reproducible, extensible evaluations across both text and voice channels.
The four enterprise environments
Our benchmark encompasses four distinct domains, each posing unique challenges:
- Appointments management: Agents simulate healthcare scheduling, handling operations like booking, rescheduling, and generating appointment summaries — all under strict privacy protocols.
- Financial transactions: Operating within a credit card service platform, agents process sensitive actions (e.g., balance inquiries, payments, disputes) while enforcing multi-factor verification and compliance rules.
- Inbound sales: Agents manage lead qualification, update customer profiles, schedule demos, and coordinate handoffs, requiring precise capture of customer intent and adherence to sales qualification frameworks.
- Order management: In an e-commerce scenario, agents verify order statuses, process returns or refunds, apply discounts, and handle multi-step exchanges, all while confirming customer identity and following return policies.
Together, these environments represent a breadth of enterprise operations, from front-line customer interactions to back-office processing.
Task design and complexity
The tasks span a spectrum of complexity:
- Simple tasks: Single-function calls (e.g., “What’s the status of order #12345?”).
- Complex tasks: Multi-step processes requiring conditional logic and sequential API invocations (e.g., disputing a transaction, generating a combined report, and notifying multiple stakeholders).
All tasks are human-verified to ensure realism and appropriate difficulty, enabling nuanced assessment of an agent’s reasoning and tool-use abilities.
Evaluation methodology
We evaluate agents on two primary dimensions:
- Accuracy: Whether the agent completes the task correctly, measured by comparing the final system state and required outputs against ground truth.
- Efficiency: Secondary metrics include the number of conversational turns and total token usage, reflecting the agent’s conversational economy.
Evaluations are conducted in text-only and voice-based modalities, with optional noise injection to test robustness under realistic audio conditions.
Implementation details
The benchmark is implemented in Python, featuring:
- Modular definitions: Easy addition of new domains, tasks, and evaluation metrics.
- Client-agent simulation: Framework for realistic human-agent dialogue flows.
- Multi-provider support: Compatibility with leading model APIs (e.g., OpenAI, Google).
- Voice processing: Built-in TTS and STT components, with configurable noise settings.
We will be releasing an open-source version in the near future. The architecture ensures that researchers and practitioners can extend it to emerging use cases and modalities.
Experimental results
Initial experiments across leading models (e.g., GPT-4o, GPT-4.5, GPT-4.1, and open-source Llama variants) yielded the following insights:
Key findings include:
- Financial transactions are hardest: Due to strict verification and precise calculations, all models showed the lowest accuracy in the financial domain.
- Voice vs. text gap: Voice-based interactions averaged a 5–8% drop in accuracy compared to text, underscoring challenges in speech recognition and synthesis.
- Complexity penalty: Tasks requiring four or more function calls saw a 10–15% performance decline across models.
Discussion and insights
The results reveal several important trends:
- Domain specialization: Even leading frontier models struggle with specialized workflows, highlighting the need for domain-adapted training or retrieval augmentation.
- Security protocol compliance: Agents often skipped mandatory verification steps, especially in financial tasks—an area demanding improved alignment between language understanding and procedural constraints.
- Multi-step reasoning: Performance drops significantly in workflows involving conditional logic, indicating that chaining tool calls remains a major hurdle.
- Voice challenges: Errors in speech processing can compound across turns, suggesting a need for more robust ASR/TTS pipelines tailored for noisy enterprise settings.
Limitations and future directions
While comprehensive, there are still limitations:
- Synthetic environments: Real-world edge cases and user diversions may not be fully captured.
- Lack of personalization: The benchmark does not test long-term memory or user preferences.
- Subjective qualities: Measures like empathy, politeness, or conversational naturalness are not evaluated in this initial version.
- English-only: Non-English enterprise operations remain unexplored.
Future extensions could include multilingual tasks, long-term user modeling, and subjective user experience evaluations to further enrich the benchmark.
Conclusion
Our approach offers a critical tool for advancing enterprise AI agents by providing a unified, extensible benchmark that stresses domain knowledge, tool integration, and voice-based interaction. The open-source nature will invite community contributions to broaden domains, enhance realism, and refine evaluation metrics. As AI assistants become ever more integral to business processes, our benchmark will play a key role in guiding the development of agents that are not only conversationally fluent but also operationally reliable and secure.