
This post provides the basics to help understand function calling in LLMs, a key capability that enables LLM agents to automate complex workflows.
- What is function calling in LLM agents?
- Examples of agents interacting with tools
- Comparing the tool calling abilities of LLMs
- Benchmarks for tool calling in LLMs
- Conclusion
What is function calling in LLM agents?
Function calling is the core capability that enables LLMs to perform actions rather than just generate text, and is currently in the focus of AI and agent development. Function calling is also known as tool calling, and both terms coexist for historic reasons. First, LLMs only had access to a small selection of functions. Later, models became able to handle larger collections of external APIs, hence the name tool (as in “toolbox”) was established.
An easy way to understand tool calling is that it lets LLMs actually “do stuff” instead of just “talk about stuff”. It enables LLMs to interact with external tools (with “tools” in this context referring to external functions or APIs made available to the LLM).
For example, if you ask an LLM without function calling ability about the current weather in Vienna, it may generate text that describes a weather situation. But the model actually has no knowledge of the current weather in the location specified in the prompt. The generated text may read like a weather report, but it may not have anything to do with the actual weather in Vienna. With a tool that lets the LLM fetch the current weather data (like temperature and precipitation) in the provided location, however, the model will provide actual and up-to-date weather information.
The tool in this context could be a Python (or any computable) function that has a fitting name, a description to the LLM of what the tool does, and a description of its parameters (ideally with data formatting defined, for example, via a JSON schema). When answering a user prompt, instead of just generating text, the LLM may scan its available tools and, having found a relevant one that can provide weather information, request a call to that function as a structured response.
The following diagram shows how this process works in detail. Essentially, the LLM has to ask the Agent Scaffolding (the system surrounding the LLM) to perform a tool call on its behalf:
This sounds complex, but implementing tool calling for an LLM in practice is often much simpler. Luckily, most of the magic (i.e. explaining to the LLM which tools it can use and which format it needs to adhere to) is handled either by a LLM provider's API or modern frameworks.
As the diagram shows, how the LLM processes and invokes tools from the Agent Scaffold has to be custom tailored to the LLM (and again, is often handled by the provider’s API or agent framework). However, how the Agent Scaffolding invokes tools on behalf of the LLM (the blue “tool call”) was de-facto standardized via the Model Context Protocol.
Tools make our interactions with LLMs (and especially agents) more useful, as they enable the user to use natural language prompts to interact with complex systems, while the LLM handles the underlying (function calling) operations. It is a key feature in enabling complex automation tasks and agentic workflows. Parallel function calling further extends its efficiency for complex use cases.
Examples of agents interacting with tools
Being able to access APIs can enable the models to query databases to retrieve live data or to interact with an external piece of software. Examples include:
- A calculator function so that LLM calculates values rather than generating text
- Querying the current inventory of a certain item (for instance in an e-commerce environment)
- Retrieving data using a stock market API
- A currency calculator based on real-time data
- Connecting the model to office or collaboration applications like Gmail or GitHub (so the agent can send emails or push PRs).
Comparing the tool calling abilities of LLMs
Most providers of frontier LLMs including Anthropic, Cohere, Google, Mistral, OpenAI and others offer models that support a form of tool calling natively in their API. However, conventions for tool calls and how tool schemas are formatted are different across providers. There are also differences in whether you can force the model to make a tool call. Finally, it is also possible to implement tool calling with models that are not fine-tuned to support tool calling natively as a fallback.
See the following resource from Analytics Vidhya for a comparative overview of tool-calling features across different LLM providers:

Also, we link documentations of some of the main LLM providers for details on the function calling abilities of their models: Anthropic, Cohere, Google, Meta, Mistral, OpenAI.
Benchmarks for tool calling in LLMs
Since tool calling is such a central factor in the real-life usability of agentic workflows with LLMs, models should be evaluated by how well they perform in function calling. After all, the model still has to understand which tools are available, and generate the instructions on which tools it wants to use. As LLMs are nondeterministic by design, there is no guarantee that tool calling works flawlessly all the time. Here’s an introduction of three of the most popular tool calling evaluations for LLMs.
Berkeley Function Calling Leaderboard (BFCL)
GitHubThe Berkeley Function-Calling Leaderboard evaluates the ability of models to accurately call functions. The benchmark contains real-world data and the leaderboard is periodically updated to include the latest models.
BFCL evaluates LLMs based on different categories of function-calling:
Simple function calls
The model is prompted to call a single function based on a single function description. It just has to format the arguments properly to call the tool.
Example:
- Functions:
get_temperature(city_name)
- Query: “What is the temperature in Vienna?”
Parallel function calls
A single user prompt requires multiple tool calls to the same tool. The model is expected to determine how many times (and with what arguments) the function needs to be invoked.
Example:
- Functions:
get_temperature(city_name)
- Query: “What are the temperatures in Vienna and Linz?”
Multiple function calls
A single call needs to be made, with 2-4 tools available. The model is expected to choose the right tool to use.
Example:
- Functions:
get_temperature(city_name)
get_sun_hours(city_name)
- Query: “What is the temperature in Vienna?”
Parallel multiple function calls
Multiple function calls in parallel. Multiple function descriptions are provided to the model, with each of them required to be invoked zero or multiple times. The model is expected to determine which tool to use, how many times, and using what arguments.
Example:
- Functions:
get_temperature(city_name)
get_sun_hours(city_name)
- Query: “What are the temperatures in Vienna and Linz? And how many sun hours are there?”
Relevance detection
Scenarios where all of the provided functions are irrelevant and therefore should not be invoked.
Example:
- Functions:
get_sun_hours(city_name)
- Query: “What is the temperature in Vienna?”
With the release of BFCL version 3, the benchmark was extended to support multi-turn and multi-step scenarios.
Single-Step / Single-Turn
There is only one initial user query, a single turn for the LLM to select the function(s), and a single final response. All of the scenarios outlined above are single-step and single-turn.
Multi-Step
The LLM has to execute multiple function calls sequentially.
Example:
- Functions:
get_temperature(coordinate_x, coordinate_y)
get_coordinates(city_name)
- Query: “What is the temperature in Vienna?”
Multi-Turn
There are multiple interactions between the LLM and a simulated user, each containing potentially multiple turns.
Example:
- Functions:
get_temperature(coordinate_x, coordinate_y)
get_coordinates(city_name)
- Initial query: “What is the temperature outside?”
- Follow-up answer: “I am currently in Vienna.” (assuming that the model asks where we are)
Nexus Function Calling Leaderboard (NFCL)
The Nexus Function Calling Leaderboard evaluates models based on their ability to perform function calling and API usage. This benchmark has no relevance detection component, but it does provide evaluation on multi-step function calls (where you need the output of one tool to query the next). NFCL is a single-turn function calling evaluation. Some of the use cases evaluated aren’t typical, which helps avoid overfitting. Sadly, only 3x models were ever benchmarked by NFCL, limiting the real-life usability of this benchmark.
NFCL contains 9 tasks ranging from cybersecurity to climate APIs, totalling 762 cases. It is a zero-shot benchmark (LLMs are prompted to handle tasks they have never seen before, without examples). The Nexus Function Calling Leaderboard covers both open-source and closed-source models.
τ-bench
τ-bench (TAU-bench for Tool-Agent-User) tests the interaction between agents and simulated users within specific domains (covering two domains at the time of writing: retail and airline).
The benchmark emulates dynamic conversations between LLM-simulated users and LLM agents, asking the agents to carry out complex tasks that require the use of API tools. The tasks in TAU-bench test the ability of agents to follow rules, gather information via external tools, reason about and remember information over long and complex contexts, and communicate with the user effectively in scenarios that mimic real-life conversations. The benchmark’s modular architecture enables its extension with new domains, tools, rules, and metrics.
The leaderboard is available on the benchmark’s GitHub page.
Conclusion
Function calling a cornerstone of modern AI agents. It allows LLMs to invoke actions, but is not 100% reliable yet. An important consideration to make is the choice of functions to expose to a model. While it might be tempting to give an agent lots of options, that also opens up new security risks.