BFCL Audio: A Benchmark for Audio-Native Function Calling

Latest
August 24, 2025

We are excited to announce a Salesforce AI Research and Berkeley collaboration: BFCL Audio—a new benchmark that extends BFCL to the audio domain!

A little Berkeley lore: back in 2022, we couldn’t find open-source models that handled zero-shot function calling reliably—so we trained our own. We released Gorilla OpenFunctions v1 (and later v2), then ran into the obvious next question: how do we measure whether models are actually good at function calling? That question became BFCL.

Since then, function-calling evaluation has turned out to be far richer—and full of open research questions—than anyone expected.

🧠 BFCL v1 introduced AST-based (Abstract Syntax Tree) evaluation—still the gold standard for zero-shot calls.

🤝 BFCL v2 turned into a community effort, with thousands of APIs and prompts contributed by hobbyists and enterprises.

🔁 BFCL v3 expanded into multi-turn, multi-step evaluation with state-based tracking.

🧭 BFCL v4 focuses on tool-calling in real-world, agentic settings (web search, memory, format sensitivity).

🚀 Today, BFCL is a foundational benchmark used across leading labs, with hundreds of contributors (and thousands more sharing APIs and data). We’re deeply grateful for the community’s trust—it has shaped BFCL’s evolution.

As models begin to make their way to multimodal enterprise use cases, it’s time to build a new benchmark that can provide insight into enterprise voice.

💻 Run the benchmark: https://github.com/ShishirPatil/gorilla

Real products don’t live in pure text. Voice shows up wherever hands are busy or eyes are busy: phone support, in-car assistants, smart homes, wearables, voice note apps, and accessibility workflows. In these settings, the agent must balance natural, low-latency dialog with reliable, precise action execution—which means evaluation must consider both.

From the enterprise perspective, businesses seek to automate their customer support and call center operations. This often involves handling a high volume of diverse customer inquiries, scheduling appointments, resolving issues, and providing information. The need for precise function calling is paramount in these scenarios, as errors can lead to frustrated customers, inefficient operations, and lost revenue. For example, a misheard account number or an incorrect appointment time can severely impact customer satisfaction and operational efficiency. Furthermore, the ability to integrate with existing CRM and backend systems is crucial for seamless automation, making robust and reliable audio-native function calling a significant advantage.

Architectural Paths for Voice Agents

There are two common architectures:

End-to-End (E2E) speech ↔ speech

A single consumes audio natively and can produce audio directly.

Strengths:

Natural prosody and low latency (no cascaded hops).
Unified reasoning over acoustics + semantics (can sometimes recover what ASR would miss).

Trade-offs:

Tool-call precision can lag without extra structure.
Fewer knobs for domain adaptation (custom lexicons, per-domain biasing).
Very limited model availability (e.g., GPT-4o, Gemini 2.5).

Cascaded (ASR → LLM → TTS)

Audio is transcribed to text (ASR), processed by a text LLM, then spoken via TTS.

Strengths:

Reuses mature text LLM stacks and evaluation tooling.
Swap components independently (ASR/TTS/LLM).
Easy to add guardrails (regex/constrained decoding/AST checks) on the text side.

Trade-offs:

ASR errors become a bottleneck—often the critical failure point.
Latency can add up across hops if not streaming.
The LLM never “hears” the waveform, so it can’t use acoustic cues to recover intent.

ASR is very good—but not perfect. While a small percentage of errors may be acceptable for simple transcription, they can be catastrophic for function calling, where precision is paramount. The link between a minor ASR error and a total task failure is direct. For instance, a user might be interacting with a financial application and state their Employer Identification Number. The ASR system might correctly transcribe most of the number but miss or substitute a single digit. Even if the overall text transcription appears largely correct to a human observer, the resulting function call will pass an invalid EIN to the backend system, causing the API call to fail. The rigidity of the API endpoint means there is no room for “close enough”.

Compared with typed input, audio introduces systematic shifts:

Conversational fillers: “uh”, “hmm”, “you know”.
Acoustic artifacts and issues absent from text corpora.
Accents, background noise, and cross-talk degrade recognition.
Homophones & named entities get misheard:
- John vs Jon
- final_report.pdf vs final report.pdf vs `finalReport.pdf`
Even strong ASR systems still propagate non-trivial word error rates, and crucially, the text LLM never sees the raw audio to recover intent.

1) Natural Paraphrasing

We take existing BFCL queries (single-turn non-live and multi-turn) and rewrite them into conversational-style speech.

Original (text BFCL):
“I need to send a letter to Liam Neeson. Find his contact information for me.”

Paraphrased (audio BFCL):
“Um, can you get Liam Neeson, thats L-I-A-M N-E-E-S-O-N, Liam Neesons contact info, oh, so I can send him a letter?”

2) Synthetic Audio Generation

We then synthesize audio from the paraphrases using a variety of TTS engines (Qwen, OpenAI, Gemini, ElevenLabs, Cartesia). Each engine has its own style and prosody; we sample them to diversify inputs.

For E2E models, the audio snippet is the input.
For cascaded models, we provide transcripts (below).

3) Three-Tier ASR Transcription (for Pipelined Setups)

Because the pipelined systems can’t access the waveform, we pre-transcribe every audio sample using three ASR systems (OpenAI, ElevenLabs, Deepgram) and evaluate models separately on each transcript to expose sensitivity to ASR choices.

OpenAI: "Um, can you get Liam Neeson-that's L-I-A-M N-E-E-S-O-N- Liam Neeson's contact info? Oh, so I can send him a letter?"

ElevenLabs: "Um, can you get Liam Neeson, that's L-I-A-M N-E-E-S-O-N, Liam Neeson's contact info? Oh, so I can send him a letter?"

DeepGram: "Can you get Liam Neeeson? That's l I a m n e e s o n, Liam Neeeson's contact info. Oh, so I can send him a letter?" 

(Also, notice the extra e in Neeeson for DeepGram output)

> Note: Only user messages undergo these transformations. Any system messages remain in their original text form.

Evaluation Protocol & Metric Changes

To inform models that they are in an audio setting, we prepend a short system prompt to each conversation:

You are a voice assistant that interacts with the user exclusively through spoken conversation. You receive user utterances as text transcribed by an upstream ASR system and your replies are delivered to the user through a TTS system. Follow the rules below at all times:

1. Language

* Mirror the user's language. Respond in the same language detected in the transcription.

2. Robustness to ASR Errors (Important)

* Although the upstream ASR system is designed to be robust, it may still make mistakes.
* Do not trust the transcription text blindly, especially on important information. You should assume the transcript may contain recognition mistakes.
* If the text appears garbled, double check with the user instead of guessing.

3. Clarity for TTS

* When responding to the user, you should **spell out acronyms** as separate letters with spaces (“A I M L”), and **chunk long numbers** into 2- or 3-digit groups, separated by short pauses (“one-two-three, four-five-six”).
* Favor spoken-language style: short sentences, everyday vocabulary, and natural contractions.

Turn Semantics (Why Audio is Different)

In text-BFCL, each turn continues as long as the model keeps emitting valid non-empty tool calls (decoded by decode_exec). The turn ends the moment the model emits any non-tool message.

That’s not ideal in a voice setting. Because of homophones and ASR issues, a good audio agent should proactively clarify spellings or key values before acting. Penalizing that behavior would encourage reckless tool calls.

Clarification Mechanism

We add an LLM judge plus a simulated user to support spelling/disambiguation clarifications without rewarding chitchat.

If the model asks for a spelling-related clarification (as judged by the LLM), we generate a concise user reply using a whitelist of allowed clarifications for that query (e.g., person names, file names, IDs).
Only spelling/format confirmations count.
Effect: every task can become multi-step, but only allowed clarifications are honored.

Example Interaction (Idealized):

Per-message whitelist example:

Message: "Um, can you get Liam Neeson, that's L-I-A-M N-E-E-S-O-N, Liam Neeson's contact info? Oh, so I can send him a letter?"

Allowed clarifications: {

    "person_name": "Liam Neeson"

}

Judge prompt:

The judge sees the intended request, the ASR text, the assistant’s message, and the allowed clarification keys.
It approves only if the assistant is explicitly confirming spellings/values that appear in the whitelist. Otherwise, it rejects.

You are a judge for an audio-chat scenario where a user speaks and an ASR system transcribes their speech for the assistant. The assistant only sees text (the ASR transcript), which is likely to contain transcription errors.

You are given:
- intended_request: the user's original, ground-truth intent.
- asr_text: the ASR-transcribed text the assistant saw.
- allowed_clarifications: a set of fields with canonical spellings/values the user can confirm (e.g., names, IDs, emails, dates, numbers).
- assistant_message: the assistant's latest message.

Your job: decide whether assistant_message is a clarifying question specifically about spelling/verification of intent or exact strings/values that could plausibly be misheard (e.g., names, organizations, emails, serials/IDs, numbers, dates, addresses, SKUs). Do not allow general follow-ups (preference, steps to proceed, etc.).

Decision rules:
1. Classify the message as a spelling confirmation only if it explicitly asks to verify the exact spelling/format/value of one or more items (e.g., “Is it Mikaela or Michaela?”, “Can you spell the email?”, “Is the order number A1B-52?”).
2. The request must be reasonable given the ASR risk (i.e., the item is a proper noun, key value, or easily misheard token relevant to the task).
3. To approve (allowed=true), all the topics the assistant asks to confirm must be present in allowed_clarifications. If any requested item is absent or ambiguous, set allowed=false.
4. Output only a JSON object with two fields:
- allowed: boolean
- message: string (a concise simulated user reply only when allowed=true; otherwise empty "").
1. When allowed=true, compose message by supplying only the requested values with correct spelling/format from allowed_clarifications. Keep it brief (one short sentence or a compact list). Do not include extra commentary, JSON, or fields the assistant didn't request.
2. If the assistant's message is not a confirmation request, touches topics outside spelling/format/intent verification, or requests values not available in allowed_clarifications, return allowed=false with message="".

Edge cases:
- If the assistant mixes spelling confirmation with unrelated questions, treat it as not allowed unless the spelling part stands alone and you can fully answer it from allowed_clarifications.
- Treat homophones and near-matches as spelling checks (e.g., “Brian/Bryan”, “Steven/Stephen”, letters vs. digits).
- Normalize case/diacritics but preserve canonical spelling in the final answer.
- Never reveal intended_request verbatim; only return the specific confirmed values.
The user's original intended request is: {the original text bfcl question}

The ASR-transcribed output is: {the transcribed text from the audio, which is also the input to the model}

assistant_message: {the model's response}

allowed_clarifications (topic -> answer): {the allowed_clarifications}

Evaluation metric:

We keep the same metrics as text-BFCL—AST for single-turn; state-based + response-based checks for multi-turn—and ignore the clarification turns when computing the final function-calling score. In other words, clarifications can enable a correct action, but they don’t directly inflate the score.

Results

We present results on BFCL-Audio.

When using the pipelined approach, we typically see a ~10–20% drop relative to BFCL in text mode. This is largely because models fail to correctly handle entity dictation over the pipeline.

In the same audio setting, E2E models and pipelined systems show a measurable gap. We observe that End-to-End systems pay an even higher degradation in BFCL Audio relative to the original text baseline. We suspect that this is because of post-training weakness in terms of multimodal function calling. While E2E models are the strongest in terms of naturalness and responsiveness in chitchat, we find that they underperform pipelined systems for function calling.

Example Failure Mode

Blind ASR trust results in the wrong tool call!

User (ASR): Move final report.pdf from the document directory to the temp directory in document. Oh, and create the directory if it doesn’t exist.

Allowed clarifications:

User (ASR): Move final report.pdf from the document directory to the temp directory in document. Oh, and create the directory if it doesn't exist.

Allowed clarifications:
{
  "source_file": "final_report.pdf",
  "source_directory_name": "document",
  "destination_directory_name": "document/temp"
}

Model tool call:
mv("final report.pdf", destination="temp")

Tool response:
Error: no such file exists…

Model reply:
I created the 'temp' directory inside 'document', but couldn’t find “final report.pdf”. Where is that file?

What we wanted instead:

Ask for confirmation on the exact filename first (it’s on the whitelist), then proceed.proceed.

Contributors: Huanzhi Mao (Salesforce AI Research, UC Berkeley), Antonio A. Ginart (Salesforce AI Research), Joseph E. Gonzalez (UC Berkeley), John R. Emmons (Salesforce AI Research)

Source link