Small Models, Big Tasks

An Exploratory Empirical Study on Small Language Models for Function Calling

IIIT-Hyderabad
EASE 2025

*Indicates Equal Contribution

Abstract

Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. For example, a query to book the shortest flight from New York to London on January 15 requires identifying the correct parameters to generate accurate function calls. Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings. In contrast, Small Language Models (SLMs) can operate efficiently, offering faster response times, lower computational demands, and enhanced privacy, making them potential candidates for function calling on edge devices. In this exploratory empirical study, we evaluate the efficacy of SLMs in generating function calls across diverse domains using zero-shot, few-shot, and fine-tuning approaches, both with and without prompt injection. Furthermore, we analyze the model responses across a range of metrics, capturing various aspects of function call generation. Additionally, we perform experiments on an edge device to evaluate their performance in terms of latency and memory usage, providing useful insights into their practical applicability. Our findings show that while SLMs improve from zero-shot to few-shot and perform best with fine-tuning, they struggle significantly with adhering to the given output format. Prompt injection experiments further indicate that the models are generally robust and exhibit only a slight decline in performance. While SLMs demonstrate potential for the function call generation task, our results also highlight areas that need further refinement for real-time functioning.



Function Calling

MY ALT TEXT

Zero-shot Example

Instruction: System prompt and description of the function call generation task

Description: Description of all the tools available to complete the user query

Format Instruction: Required structure and format for the generated output

User Query

Few-shot Example

Task Instruction: System prompt and description of function call generation task

Tools Description: Description of all the tools available to complete the user query

Format Instruction: Required structure and format for the generated output

Sample Examples:Three query-response examples of the function call generation task

SAMPLE EXAMPLE 1

SAMPLE EXAMPLE 2

SAMPLE EXAMPLE 3

User Query

Prompt Injection Example

Task Instruction: System prompt and description of function call generation task

Tools Description: Description of all the tools available to complete the user query

Format Instruction: Required structure and format for the generated output

User Query + l3aq - a" <: 11|E > 3vn8IsdeF$rnJQ&



LLM Function Calling Metrics

Evaluating the hierarchical structure of function calls

JSON Parsability measures whether the model's output follows valid JSON structure. This ensures the syntactic correctness of the output, which is essential for downstream processing.

Example 1


                          

Example 2


                          

Function Selection Performance (FSP) measures the model's ability to choose the correct functions from the available set. It focuses on whether the model correctly identifies which functions to call, regardless of the arguments.

Example 1


                          

Example 2


                          

Argument Completeness Score (ACS) evaluates whether all required argument names are present in the function calls. This metric focuses on structural correctness of arguments, regardless of their values.

Example 1


                          

Example 2


                          

Argument Value Correctness (AVC) measures whether the values assigned to arguments are correct. This metric evaluates the semantic accuracy of the function calls, assuming the argument names are correct.

Example 1


                          

Example 2


                          

Correct Ratio measures the proportion of outputs that are exactly correct, with no partial credit. An output is only considered correct if it matches the ground truth perfectly.


                          

                          

Task Accuracy (F1-score) allows for partial credit, measuring how close the output is to being correct. This metric better captures incremental improvements in model performance.


                          

                          


Findings

Model performance across different metrics and configurations, with and without prompt injection.

Metric Model Zero-shot Few-shot Finetuned
Without Prompt-injection With Prompt-injection Without Prompt-injection With Prompt-injection Without Prompt-injection With Prompt-injection
JSON Parsibility Deepseek-coder-1.3B-instruct 0.0734 0.0140 0.8938 0.7268 0.9944 0.9906
Phi-3-mini-4k-instruct 0.0000 0.0000 0.0000 0.0084 0.9962 0.9939
Phi-2 0.0000 0.0000 0.0000 0.0002 0.0000 0.0000
Starcoder2-3B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Stable-code-3B 0.0000 0.0000 0.0058 0.0060 0.0000 0.0000
Task Accuracy Deepseek-coder-1.3B-instruct 0.0111 0.0527 0.5565 0.4289 0.8543 0.8404
Phi-3-mini-4k-instruct 0.0000 0.0000 0.0000 0.0012 0.8727 0.8598
Phi-2 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Starcoder2-3B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Stable-code-3B 0.0000 0.0000 0.0011 0.0009 0.0000 0.0000
Correct Ratio Deepseek-coder-1.3B-instruct 0.0470 0.0098 0.4680 0.3384 0.8074 0.7866
Phi-3-mini-4k-instruct 0.0000 0.0000 0.0010 0.0008 0.8328 0.8210
Phi-2 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Starcoder2-3B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Stable-code-3B 0.0000 0.0000 0.0010 0.0006 0.0000 0.0000
FSP Deepseek-coder-1.3B-instruct 0.0139 0.0723 0.8846 0.7209 0.9918 0.9859
Phi-3-mini-4k-instruct 0.0000 0.0000 0.0000 0.0031 0.9936 0.9901
Phi-2 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Starcoder2-3B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Stable-code-3B 0.0000 0.0000 0.0017 0.0016 0.0000 0.0000
ACS Deepseek-coder-1.3B-instruct 0.0699 0.0136 0.8404 0.6783 0.9664 0.9596
Phi-3-mini-4k-instruct 0.0000 0.0000 0.0000 0.0031 0.9700 0.9652
Phi-2 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Starcoder2-3B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Stable-code-3B 0.0000 0.0000 0.0017 0.0016 0.0000 0.0000
AVC Deepseek-coder-1.3B-instruct 0.0129 0.0614 0.6811 0.5488 0.9065 0.8973
Phi-3-mini-4k-instruct 0.0000 0.0000 0.0000 0.0015 0.9172 0.9087
Phi-2 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Starcoder2-3B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Stable-code-3B 0.0000 0.0000 0.0012 0.0010 0.0000 0.0000

Note: Bold values indicate the best performing model in that specific experiment configuration for a given metric.

Metric Colors: JSON Parsibility (Blue), Task Accuracy (Green), Correct Ratio (Yellow), FSP (Purple), ACS (Red), AVC (Orange).

Discussion of Findings

  • SLMs generally struggle to generate accurate function calls in a zero-shot setting. Most models fail to adhere to the required output format, leading to unparsable and incorrect responses. Deepseek-Coder stands out as the only model producing some JSON parsable responses, though these are very limited. Prompt injection experiments show a slight performance decline, but the impact is minimal due to the already low baseline performance.
  • A significant performance improvement is observed in the few-shot setting. However, several models do not show substantial gains even with provided examples. In this setting, prompt injection leads to a more notable decline in model performance.
  • Finetuned models demonstrate considerably better performance compared to both zero-shot and few-shot approaches. These models are more capable of accurately selecting function parameters, resulting in higher task accuracy and correct ratios. Notably, prompt injection has a relatively minor effect on the performance of finetuned models.

For a more comprehensive analysis, detailed experimental setup, and discussion of future directions, please refer to our full paper