Enabling Small Language Models as Efficient and Capable Agents
| dc.contributor.author | Srivastava, Gaurav | en |
| dc.contributor.committeechair | Wang, Xuan | en |
| dc.contributor.committeemember | Thomas, Christopher Lee | en |
| dc.contributor.committeemember | Ramakrishnan, Narendran | en |
| dc.contributor.committeemember | Vu, Tu | en |
| dc.contributor.department | Computer Science and#38; Applications | en |
| dc.date.accessioned | 2026-05-22T08:00:26Z | en |
| dc.date.available | 2026-05-22T08:00:26Z | en |
| dc.date.issued | 2026-05-21 | en |
| dc.description.abstract | Most agentic systems today are built around large language models (LLMs) accessed through proprietary APIs (for example, GPT, Claude, and Gemini), which raises concerns about cost, latency, and privacy. This thesis argues that small language models (SLMs), typically under 30 billion parameters, can serve as efficient and capable agents when paired with the right system design choices. The work proceeds as a sequence of five studies that together support this case. We begin with ThinkSLM, a study of 72 small models across 17 reasoning tasks, which shows that training methodology and data quality drive reasoning more than parameter count. This motivates Debate, Train, Evolve (DTE), a self-evolution framework that turns multi-agent debate traces into reinforcement learning signals, improving small-model reasoning without ground-truth supervision and matching or surpassing the multi-agent system at single-model inference cost. The limits we observe in DTE prompt a closer look at how models allocate compute, leading to our overthinking analysis (LLMThinkBench), which shows that reasoning-trained models often produce around 18 times more tokens on basic math while achieving lower accuracy. To investigate memorization further, we develop BeyondBench, a contamination-resistant evaluation framework that algorithmically generates problems from combinatorial spaces of more than 10^15 instances across 44 tasks and 117 variations. Evaluating 101 models shows that hard-suite language-only performance remains low for many strong models, such as Gemini-2.5-Pro at 56.21%, while tool-augmented GPT-5 reaches 71.68% on the same suite, suggesting that agentic capabilities are a useful complement to raw scale. We synthesize these insights into EffGen, an open-source agentic framework built from the ground up for SLMs. EffGen contributes prompt optimization that compresses context by 70--80%, complexity-based routing, task decomposition into parallel and sequential subgraphs, a unified three-tier memory system, and the first unified implementation of the MCP, A2A, and ACP protocols. Across 13 benchmarks, EffGen consistently outperforms LangChain, AutoGen, and Smolagents in success rate, latency, and memory use. Together, these results show that with the right system design, small models combined with tools, memory, and intelligent orchestration can perform competitively with much larger models on a meaningful range of tasks. The contribution of this thesis is to identify the regimes in which SLMs are effective, to characterize where they break, and to provide an open framework that lets practitioners deploy them responsibly. | en |
| dc.description.abstractgeneral | Artificial intelligence systems powered by language models have become highly capable, but the largest and most powerful systems are expensive to run, slow to respond, and raise privacy concerns because they often require sending personal data to outside servers. This thesis asks a simple question: can smaller, more affordable language models reason well enough to be useful, and if not, how can we build systems that make them effective? The work proceeds through five connected studies. We first measured how well 72 small models reason across a wide range of tasks and found that how a model is trained matters more than how large it is. Building on this finding, we developed a method that allows small models to improve their own reasoning by debating with copies of themselves and learning from the result, without requiring human-provided answers. We then observed that some models trained for deeper thinking overthink simple problems, generating much longer answers while in some cases producing incorrect ones. We next constructed a new way to test reasoning that prevents models from succeeding through recall of memorized answers. Across more than a hundred models, this test showed that pure language-based reasoning encounters a ceiling on hard problems, and that the gap is closed not by enlarging models but by providing them with tools such as calculators and code execution. Finally, we incorporated these findings into EffGen, an open-source software framework that enables small language models to function as capable AI assistants by combining them with tools, memory, and intelligent task management. Our work demonstrates that, with appropriate system design, smaller models combined with tools, memory, and intelligent task management can be competitive with much larger models across a useful range of tasks. The contribution of this thesis is to identify where small models perform well, where they fail, and how to build systems around them responsibly. | en |
| dc.description.degree | Master of Science | en |
| dc.format.medium | ETD | en |
| dc.identifier.other | vt_gsexam:46340 | en |
| dc.identifier.uri | https://hdl.handle.net/10919/143129 | en |
| dc.language.iso | en | en |
| dc.publisher | Virginia Tech | en |
| dc.rights | In Copyright | en |
| dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
| dc.subject | Small Language Models | en |
| dc.subject | Reasoning | en |
| dc.subject | Benchmarking | en |
| dc.subject | Agentic Systems | en |
| dc.subject | Language Model Evaluation | en |
| dc.subject | EffGen | en |
| dc.subject | BeyondBench | en |
| dc.subject | Self-Evolution | en |
| dc.subject | Multi-Agent Debate | en |
| dc.subject | Tool Use | en |
| dc.title | Enabling Small Language Models as Efficient and Capable Agents | en |
| dc.type | Thesis | en |
| thesis.degree.discipline | Computer Science & Applications | en |
| thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
| thesis.degree.level | masters | en |
| thesis.degree.name | Master of Science | en |
Files
Original bundle
1 - 1 of 1