Steps Toward Open-ended Reasoning and Discovery with Language Models

Loading...
Thumbnail Image

TR Number

Date

2026-06-16

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Scientific discovery -- the process of distilling nature's complexity into compact, transferable knowledge -- has historically relied on human creativity, expertise, and intuition. Recent advances in large language models (LLMs), trained on vast amounts of scientific literature, raise a fundamental question: can these systems move beyond recovering existing knowledge to meaningfully participate in discovery? This thesis investigates this question across four research directions, progressively developing the capabilities necessary for open-ended discovery. First, we show that effective discovery systems require both broad scientific knowledge and systematic search. We introduce LLM-SR, a framework for scientific model discovery that combines LLM knowledge with evolutionary search, where LLMs guide the mutation and crossover of candidate hypotheses. Our results show that LLM-SR substantially outperforms state-of-the-art baselines. The second study examines limitations in current evaluations of LLM-driven discovery. We show that many benchmarks overestimate discovery capabilities because tasks are contaminated by training data. To address this, we introduce LLM-SRBench, a multi-domain benchmark designed with synthetic novel components to test models beyond memorization in the task of scientific model discovery. Results on LLM-SRBench show significant performance drop across existing methods, highlighting the importance of rigorous evaluation protocols for discovery. The third study investigates the role of adaptation. While humans continuously learn and adjust when facing unfamiliar environments, most existing LLM-based systems rely primarily on their pretrained knowledge during the search process. Motivated by recent advances in test-time training and reinforcement learning, we introduce DecAEvolve, a framework that enables models to adapt dynamically during evolutionary search with feedback obtained from the environment. We show that DecAEvolve substantially improves performance on out-of-distribution settings, establishing adaptation as a core requirement for discovery. Finally, the last study examines the role of exploration and diversity. We find that current LLM-based discovery systems often converge to narrow regions of the hypothesis space, limiting creativity and hindering stronger solutions in open-ended tasks. To address this, we introduce EvoDiverse, a framework that promotes diversity during evolutionary search. Across multiple scientific discovery tasks, EvoDiverse enables broader exploration and uncovers more promising regions of the search space, highlighting the importance of systematic exploration in open-ended discovery. Taken together, this thesis suggests that LLMs can actually become effective engines of discovery when equipped with principled search, rigorous evaluation, continuous adaptation, and diversity-preserving exploration -- four properties that we believe together define the path towards open-ended reasoning and discovery with language models.

Description

Keywords

large language models, reasoning, open-endedness, scientific discovery, evolutionary search, adaptation, exploration, test-time training

Citation