Optical Character Recognition for Pre-Digital Historical Documents using Large Language Models
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Multi-modal large language models (LLMs) with vision capabilities have been shown to be a promising and effective approach to extracting text from images or optical character recognition (OCR). However, text extraction using OCR is challenging when applied to scanned pre-digital historical documents. Factors such as poor scan quality due to the age of the documents, various levels of fading, scan skew (left or right), and possible abnormal or poor background-to-text contrast can contribute to incorrect OCR results. Performing OCR on scanned pre-digital historical documents makes them machine-readable, thus enabling computational analysis and preservation. This can lead to a better understanding of the past, especially for significant events and time periods. Given the text extraction capabilities of LLMs with vision, we posit that they are a viable option for performing robust OCR on scanned pre-digital historical documents. We chose a set of foundational and capable OCR technologies to compare to LLMs with vision. To accomplish this, we curated a ground truth dataset comprising scanned predigital historical documents from the early twentieth century for comparing the chosen OCR technologies. Our experiments showed that LLMs with vision, specifically Mistral AI’s Mistral- Small-3.1-24B-Instruct-2503 model and Allen AI’s olmOCR-7B- 0225-preview, are able to perform OCR on our dataset very well. We evaluated using the character error rate (CER), BLEU score, multiple ROUGE scores, and the Normalized Levenshtein Distance (NLD) for each OCR technology. Mistral had better results but the potential for large error (2 cases out of 359 from our dataset), while olmOCR performed almost as well but was more consistent in mitigating high error. These results support the use LLMs with vision to perform OCR on scans of pre-digital historical documents with challenging characteristics.