Enabling Small Language Models as Efficient and Capable Agents

Loading...
Thumbnail Image

TR Number

Date

2026-05-21

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Most agentic systems today are built around large language models (LLMs) accessed through proprietary APIs (for example, GPT, Claude, and Gemini), which raises concerns about cost, latency, and privacy. This thesis argues that small language models (SLMs), typically under 30 billion parameters, can serve as efficient and capable agents when paired with the right system design choices. The work proceeds as a sequence of five studies that together support this case. We begin with ThinkSLM, a study of 72 small models across 17 reasoning tasks, which shows that training methodology and data quality drive reasoning more than parameter count. This motivates Debate, Train, Evolve (DTE), a self-evolution framework that turns multi-agent debate traces into reinforcement learning signals, improving small-model reasoning without ground-truth supervision and matching or surpassing the multi-agent system at single-model inference cost. The limits we observe in DTE prompt a closer look at how models allocate compute, leading to our overthinking analysis (LLMThinkBench), which shows that reasoning-trained models often produce around 18 times more tokens on basic math while achieving lower accuracy. To investigate memorization further, we develop BeyondBench, a contamination-resistant evaluation framework that algorithmically generates problems from combinatorial spaces of more than 10^15 instances across 44 tasks and 117 variations. Evaluating 101 models shows that hard-suite language-only performance remains low for many strong models, such as Gemini-2.5-Pro at 56.21%, while tool-augmented GPT-5 reaches 71.68% on the same suite, suggesting that agentic capabilities are a useful complement to raw scale. We synthesize these insights into EffGen, an open-source agentic framework built from the ground up for SLMs. EffGen contributes prompt optimization that compresses context by 70--80%, complexity-based routing, task decomposition into parallel and sequential subgraphs, a unified three-tier memory system, and the first unified implementation of the MCP, A2A, and ACP protocols. Across 13 benchmarks, EffGen consistently outperforms LangChain, AutoGen, and Smolagents in success rate, latency, and memory use. Together, these results show that with the right system design, small models combined with tools, memory, and intelligent orchestration can perform competitively with much larger models on a meaningful range of tasks. The contribution of this thesis is to identify the regimes in which SLMs are effective, to characterize where they break, and to provide an open framework that lets practitioners deploy them responsibly.

Description

Keywords

Small Language Models, Reasoning, Benchmarking, Agentic Systems, Language Model Evaluation, EffGen, BeyondBench, Self-Evolution, Multi-Agent Debate, Tool Use

Citation

Collections