Small Models, Big Tasks

An Exploratory Empirical Study on Small Language Models for Function Calling

IIIT-Hyderabad
EASE 2025
^*Indicates Equal Contribution

Abstract

Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. For example, a query to book the shortest flight from New York to London on January 15 requires identifying the correct parameters to generate accurate function calls. Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings. In contrast, Small Language Models (SLMs) can operate efficiently, offering faster response times, lower computational demands, and enhanced privacy, making them potential candidates for function calling on edge devices. In this exploratory empirical study, we evaluate the efficacy of SLMs in generating function calls across diverse domains using zero-shot, few-shot, and fine-tuning approaches, both with and without prompt injection. Furthermore, we analyze the model responses across a range of metrics, capturing various aspects of function call generation. Additionally, we perform experiments on an edge device to evaluate their performance in terms of latency and memory usage, providing useful insights into their practical applicability. Our findings show that while SLMs improve from zero-shot to few-shot and perform best with fine-tuning, they struggle significantly with adhering to the given output format. Prompt injection experiments further indicate that the models are generally robust and exhibit only a slight decline in performance. While SLMs demonstrate potential for the function call generation task, our results also highlight areas that need further refinement for real-time functioning.

Function Calling

Zero-shot Example

Instruction: System prompt and description of the function call generation task

Description: Description of all the tools available to complete the user query

Format Instruction: Required structure and format for the generated output

User Query

Few-shot Example

Task Instruction: System prompt and description of function call generation task

Tools Description: Description of all the tools available to complete the user query

Format Instruction: Required structure and format for the generated output

Sample Examples:Three query-response examples of the function call generation task

SAMPLE EXAMPLE 1

SAMPLE EXAMPLE 2

SAMPLE EXAMPLE 3

User Query

Prompt Injection Example

Task Instruction: System prompt and description of function call generation task

Tools Description: Description of all the tools available to complete the user query

Format Instruction: Required structure and format for the generated output

User Query + l3aq - a" <: 11|E > 3vn8IsdeF$rnJQ&

LLM Function Calling Metrics

Evaluating the hierarchical structure of function calls

JSON Parsability measures whether the model's output follows valid JSON structure. This ensures the syntactic correctness of the output, which is essential for downstream processing.

Incorrect / Correct

Example 1

Example 2

Function Selection Performance (FSP) measures the model's ability to choose the correct functions from the available set. It focuses on whether the model correctly identifies which functions to call, regardless of the arguments.

Incorrect / Correct

Example 1

Example 2

Argument Completeness Score (ACS) evaluates whether all required argument names are present in the function calls. This metric focuses on structural correctness of arguments, regardless of their values.

Incorrect / Correct

Example 1

Example 2

Argument Value Correctness (AVC) measures whether the values assigned to arguments are correct. This metric evaluates the semantic accuracy of the function calls, assuming the argument names are correct.

Incorrect / Correct

Example 1

Example 2

Correct Ratio measures the proportion of outputs that are exactly correct, with no partial credit. An output is only considered correct if it matches the ground truth perfectly.

Incorrect / Correct

Task Accuracy (F1-score) allows for partial credit, measuring how close the output is to being correct. This metric better captures incremental improvements in model performance.

Incorrect / Correct

Findings

Model performance across different metrics and configurations, with and without prompt injection.

Metric	Model	Zero-shot		Few-shot		Finetuned
Metric	Model	Without Prompt-injection	With Prompt-injection	Without Prompt-injection	With Prompt-injection	Without Prompt-injection	With Prompt-injection
JSON Parsibility	Deepseek-coder-1.3B-instruct	0.0734	0.0140	0.8938	0.7268	0.9944	0.9906
	Phi-3-mini-4k-instruct	0.0000	0.0000	0.0000	0.0084	0.9962	0.9939
	Phi-2	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000
	Starcoder2-3B	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Stable-code-3B	0.0000	0.0000	0.0058	0.0060	0.0000	0.0000
Task Accuracy	Deepseek-coder-1.3B-instruct	0.0111	0.0527	0.5565	0.4289	0.8543	0.8404
	Phi-3-mini-4k-instruct	0.0000	0.0000	0.0000	0.0012	0.8727	0.8598
	Phi-2	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Starcoder2-3B	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Stable-code-3B	0.0000	0.0000	0.0011	0.0009	0.0000	0.0000
Correct Ratio	Deepseek-coder-1.3B-instruct	0.0470	0.0098	0.4680	0.3384	0.8074	0.7866
	Phi-3-mini-4k-instruct	0.0000	0.0000	0.0010	0.0008	0.8328	0.8210
	Phi-2	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Starcoder2-3B	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Stable-code-3B	0.0000	0.0000	0.0010	0.0006	0.0000	0.0000
FSP	Deepseek-coder-1.3B-instruct	0.0139	0.0723	0.8846	0.7209	0.9918	0.9859
	Phi-3-mini-4k-instruct	0.0000	0.0000	0.0000	0.0031	0.9936	0.9901
	Phi-2	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Starcoder2-3B	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Stable-code-3B	0.0000	0.0000	0.0017	0.0016	0.0000	0.0000
ACS	Deepseek-coder-1.3B-instruct	0.0699	0.0136	0.8404	0.6783	0.9664	0.9596
	Phi-3-mini-4k-instruct	0.0000	0.0000	0.0000	0.0031	0.9700	0.9652
	Phi-2	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Starcoder2-3B	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Stable-code-3B	0.0000	0.0000	0.0017	0.0016	0.0000	0.0000
AVC	Deepseek-coder-1.3B-instruct	0.0129	0.0614	0.6811	0.5488	0.9065	0.8973
	Phi-3-mini-4k-instruct	0.0000	0.0000	0.0000	0.0015	0.9172	0.9087
	Phi-2	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Starcoder2-3B	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	Stable-code-3B	0.0000	0.0000	0.0012	0.0010	0.0000	0.0000

Note: Bold values indicate the best performing model in that specific experiment configuration for a given metric.

Metric Colors: JSON Parsibility (Blue), Task Accuracy (Green), Correct Ratio (Yellow), FSP (Purple), ACS (Red), AVC (Orange).

Discussion of Findings

SLMs generally struggle to generate accurate function calls in a zero-shot setting. Most models fail to adhere to the required output format, leading to unparsable and incorrect responses. Deepseek-Coder stands out as the only model producing some JSON parsable responses, though these are very limited. Prompt injection experiments show a slight performance decline, but the impact is minimal due to the already low baseline performance.
A significant performance improvement is observed in the few-shot setting. However, several models do not show substantial gains even with provided examples. In this setting, prompt injection leads to a more notable decline in model performance.
Finetuned models demonstrate considerably better performance compared to both zero-shot and few-shot approaches. These models are more capable of accurately selecting function parameters, resulting in higher task accuracy and correct ratios. Notably, prompt injection has a relatively minor effect on the performance of finetuned models.

For a more comprehensive analysis, detailed experimental setup, and discussion of future directions, please refer to our full paper