Evaluating the Impact of Automated Labeling on Retrieval Instability in Neural IR

TR Number

Date

2025-07-13

Journal Title

Journal ISSN

Volume Title

Publisher

ACM

Abstract

Effective information retrieval (IR) depends on accurate relevance classification. But when the criteria are subjective or underspecified, small variations in classification can cause consequential shifts in retrieval results. The potential for such variability becomes critical for institutions when they use IR for research assessment. Retrieval instability can lead to relevant literature being overlooked, hindering a comprehensive understanding of the research landscape, and potentially undermining the validity of subsequent analyses and decisions. We investigate this problem within the context of the United Nations Sustainable Development Goals (SDGs), a global framework for addressing environmental, social, and economic challenges. Scholarly research is vital for understanding, implementing, and monitoring SDG progress. Universities report SDG-related research to demonstrate impact, and international rankings incorporate SDG alignment into evaluations, influencing funding, policy, and institutional strategy. However, the nuanced nature of the SDGs makes it difficult to define what constitutes an SDG contribution [1]. Commonly used Boolean queries and controlled vocabularies for SDG retrieval cannot reliably differentiate substantive contributions (based on semantic relevance) from mere term occurrences. In prior work, Large Language Models (LLMs) have been used to filter Boolean search results in systematic reviews by scoring documents for relevance to a specific information need [2]. Other studies demonstrate that LLMs can generate high-quality relevance labels for IR evaluation [4]. This prompted an investigation into using LLMs to judge SDG contribution through relevance filtering, which revealed variability in the judgments made by different LLMs on the same set of documents [3]. This observation suggests that the classification behavior of LLMs are sensitive to the specific parameters inherent to each model. In this study, we prompt multiple LLMs to judge the SDG relevance of abstracts retrieved using Boolean queries. Abstracts judged relevant are used as positive training examples for fine-tuning multi-label SDG classifiers. We use these classifiers to simulate retrieval, applying fixed scoring functions to isolate fluctuations in ranking stability attributable to the different LLM relevance judgments. Our goal is to analyze how the structured signal of upstream inconsistencies in LLM-derived relevance judgments manifests as variations in retrieval outcomes, providing a novel lens for investigating ranking stability under classification uncertainty. This research centers on three key questions: RQ1: How do different LLMs diverge in their filtering decisions, and what effect does this have on ranking stability in retrieval systems trained on filtered data? RQ2: Can divergence in labeling decisions be systematically explained or predicted from document content? RQ3: What distinguishes documents where LLMs disagree on relevance, and can these differences be predicted from lexical or surface-level features? Using SDG classification as a case study of subjective relevance, we evaluate retrieval stability under classification uncertainty and address broader concerns regarding the reproducibility of LLM-based classification pipelines and their downstream effects.

Description

Keywords

Citation