End-To-End Text Detection Using Deep Learning
dc.contributor.author | Ibrahim, Ahmed Sobhy Elnady | en |
dc.contributor.committeechair | Abbott, A. Lynn | en |
dc.contributor.committeemember | Huang, Bert | en |
dc.contributor.committeemember | Stilwell, Daniel J. | en |
dc.contributor.committeemember | Hussein, Mohamed E. | en |
dc.contributor.committeemember | Huang, Jia-Bin | en |
dc.contributor.department | Electrical and Computer Engineering | en |
dc.date.accessioned | 2017-12-20T09:00:16Z | en |
dc.date.available | 2017-12-20T09:00:16Z | en |
dc.date.issued | 2017-12-19 | en |
dc.description.abstract | Text detection in the wild is the problem of locating text in images of everyday scenes. It is a challenging problem due to the complexity of everyday scenes. This problem possesses a great importance for many trending applications, such as self-driving cars. Previous research in text detection has been dominated by multi-stage sequential approaches which suffer from many limitations including error propagation from one stage to the next. Another line of work is the use of deep learning techniques. Some of the deep methods used for text detection are box detection models and fully convolutional models. Box detection models suffer from the nature of the annotations, which may be too coarse to provide detailed supervision. Fully convolutional models learn to generate pixel-wise maps that represent the location of text instances in the input image. These models suffer from the inability to create accurate word level annotations without heavy post processing. To overcome these aforementioned problems we propose a novel end-to-end system based on a mix of novel deep learning techniques. The proposed system consists of an attention model, based on a new deep architecture proposed in this dissertation, followed by a deep network based on Faster-RCNN. The attention model produces a high-resolution map that indicates likely locations of text instances. A novel aspect of the system is an early fusion step that merges the attention map directly with the input image prior to word-box prediction. This approach suppresses but does not eliminate contextual information from consideration. Progressively larger models were trained in 3 separate phases. The resulting system has demonstrated an ability to detect text under difficult conditions related to illumination, resolution, and legibility. The system has exceeded the state of the art on the ICDAR 2013 and COCO-Text benchmarks with F-measure values of 0.875 and 0.533, respectively. | en |
dc.description.abstractgeneral | Text detection and recognition in the wild is the problem of locating and reading text in images of everyday scenes. Text detection refers to finding the bounding boxes that describe the location of text areas in an input image, while text recognition describes the problem of generating a transcript out of the detected text areas. Recognition can be viewed as simply Optical Character Recognition (OCR). OCR is an old problem where the developed models are considered mature. Text detection and recognition are challenging problems due to the complexity of everyday scenes, compared to the simpler problem of recognizing text in scanned documents. This problem possesses a great importance to many trending applications that need to locate and read text in the wild, such as self-driving cars. Researchers tend to focus on the text detection problem only due to the maturity of research related to text recognition. Previous research in text detection has been dominated by multi-stage sequential approaches. Those methods suffer from many limitations including, but not limited to, error propagation from the earlier stages to the later stages of the pipeline. Another line of work is the use of deep learning techniques. Deep learning is the state of the art in machine learning. It has demonstrated great success in many domains, including computer vision. Some of the deep methods used for text detection are box detection models and fully convolutional models. Box detection models learn to generate bounding box coordinates for text instances that exist in the input image. Box detection models suffer from the nature of the annotations, which may be too coarse to provide detailed supervision. Fully convolutional models learn to generate pixel-wise maps that represent the location of text instances in the input image. These models suffer from the inability to create accurate word level annotations without heavy post processing. To overcome these aforementioned problems we propose a novel end-to-end system based on a mix of novel deep learning techniques. The proposed system consists of an attention model followed by a network based on Faster-RCNN that has been conditioned to generate word-box predictions. The attention model produces a high-resolution map that indicates likely locations of text instances. A novel aspect of the system is an early fusion step that merges the attention map directly with the input image prior to word-box prediction. This approach suppresses but does not eliminate contextual information from consideration, and avoids the common problem of discarding small text regions. To facilitate training of the end-to-end system, progressively larger models were trained in 3 separate phases. The resulting system has demonstrated an ability to detect text under difficult conditions related to illumination, resolution, and legibility. The system has exceeded the state of the art on the well-known ICDAR 2013 and COCO-Text benchmarks. For the former case, the system has produced results with an F-measure value of 0.875. For the more challenging COCO-Text dataset, the system has shown a dramatic increase in performance with an F-measure value to 0.533, as compared to previously reported values in the range of 0.33 to 0.37. In order to build a powerful system, we introduced a novel deep learning architecture that achieved impressive performance on standard benchmarks. This architecture has been used as a backbone for the proposed attention model. A description of the proposed end-to-end system, as well as the implementation steps, will be detailed in the following sections. | en |
dc.description.degree | Ph. D. | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:13267 | en |
dc.identifier.uri | http://hdl.handle.net/10919/81277 | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | Deep learning (Machine learning) | en |
dc.subject | Computer Vision | en |
dc.subject | Text Detection | en |
dc.title | End-To-End Text Detection Using Deep Learning | en |
dc.type | Dissertation | en |
thesis.degree.discipline | Computer Engineering | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | doctoral | en |
thesis.degree.name | Ph. D. | en |
Files
Original bundle
1 - 1 of 1