Mathematical Expression Detection and Segmentation in Document Images

Bruce, Jacob Robert2014-03-202014-03-202014-03-19vt_gsexam:2315http://hdl.handle.net/10919/46724Various document layout analysis techniques are employed in order to enhance the accuracy of optical character recognition (OCR) in document images. Type-specific document layout analysis involves localizing and segmenting specific zones in an image so that they may be recognized by specialized OCR modules. Zones of interest include titles, headers/footers, paragraphs, images, mathematical expressions, chemical equations, musical notations, tables, circuit diagrams, among others. False positive/negative detections, oversegmentations, and undersegmentations made during the detection and segmentation stage will confuse a specialized OCR system and thus may result in garbled, incoherent output. In this work a mathematical expression detection and segmentation (MEDS) module is implemented and then thoroughly evaluated. The module is fully integrated with the open source OCR software, Tesseract, and is designed to function as a component of it. Evaluation is carried out on freely available public domain images so that future and existing techniques may be objectively compared.ETDIn Copyrightdocument layout analysisoptical character recognitionmathematical expression detection and segmentationdocument imagetype-specific layout analysisMathematical Expression Detection and Segmentation in Document ImagesThesis