CENTER FOR MACHINE PERCEPTION
Scene text localization and recognition in images and videos
CZECH TECHNICAL UNIVERSITY IN PRAGUE
A Doctoral Thesis presented to the Faculty of the Electrical Engineering of the Czech Technical University in Prague in fulllment of the requirements for the Ph.D. Degree in Study Programme No. P2612 Electrical Engineering and Information Technology, branch No. 3902V035 - Articial Intelligence and Biocybernetics, by
Ing. Luk´ aˇ s Neumann
May 2017
Available at http://cmp.felk.cvut.cz/∼neumalu1/thesis.pdf
ISSN 1213-2365
DOCTORAL THESIS
Thesis Advisor Prof. Ing. Jiˇr´ı Matas, Ph.D.
Published by Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technick´a 2, 166 27 Prague 6, Czech Republic fax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz
Scene text localization and recognition in images and videos Ing. Luk´aˇs Neumann May 2017
Abstract Scene Text Localization and Recognition methods find all areas in an image or a video that would be considered as text by a human, mark boundaries of the areas and output a sequence of characters associated with its content. They are used to process images and videos taken by a digital camera or a mobile phone and to “read” the content of each text area into a digital format, typically a list of Unicode character sequences, that can be processed in further applications. Three different methods for Scene Text Localization and Recognition were proposed in the course of the research, each one advancing the state of the art and improving the accuracy. The first method detects individual characters as Extremal Regions (ER), where the probability of each ER being a character is estimated using novel features with O(1) complexity and only ERs with locally maximal probability are selected across several image projections for the second stage, where the classification is improved using more computationally expensive features. The method was the first published method to address the complete problem of scene text localization and recognition as a whole - all previous work in the literature focused solely on different subproblems. Secondly, a novel easy-to-implement stroke detector was proposed. The detector is significantly faster and produces significantly less false detections than the commonly used ER detector. The detector efficiently produces character strokes segmentations, which are exploited in a subsequent classification phase based on features effectively calculated as part of the segmentation process. Additionally, an efficient text clustering algorithm based on text direction voting is proposed, which as well as the previous stages is scale- and rotation- invariant and supports wide variety of scripts and fonts. The third method exploits a deep-learning model, which is trained for both text detection and recognition in a single trainable pipeline. The method localizes and recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic data so it does not require obtaining expensive human annotations for training and it achieves state-of-the-art accuracy in the end-to-end text recognition on two standard datasets, whilst being an order of magnitude faster than the previous methods - the whole pipeline runs at 10 frames per second.
iii
Abstrakt ˇ Cten´ ı textu patˇr´ı mezi aktu´ aln´ı a otevˇren´e probl´emy v oblasti poˇc´ıtaˇcov´eho vidˇen´ı a umˇel´e inteligence. Metody pro lokalizaci a rozpozn´av´an´ı textu v obraze automaticky vyhled´ avaj´ı vˇsechny textov´e oblasti ve fotografii nebo videu, nach´az´ı jejich pozici a rozpozn´ avaj´ı jejich textov´ y obsah jako posloupnost znak˚ u. Tyto metody zpracov´avaj´ı obr´azky a videa poˇr´ızen´e bˇeˇzn´ ym fotoapar´atem nebo mobiln´ım telefonem a ,,ˇctou“ obsah kaˇzd´e nalezen´e textov´e oblasti do digit´aln´ıho form´atu, kter´ y m˚ uˇze b´ yt snadno pouˇzit v navazuj´ıc´ıch aplikac´ıch. V pr˚ ubˇehu v´ yzkumu byly navrˇzeny tˇri metody pro lokalizaci a rozpozn´av´an´ı textu, pˇriˇcemˇz kaˇzd´ a metoda posunula stav pozn´an´ı v t´eto oblasti a zlepˇsila celkovou pˇresnost rozpozn´ av´ an´ı. Prvn´ı metoda detekuje znaky jako Extrem´aln´ı Regiony (ER), kde pravdˇepodobnost, ˇze dan´ y region pˇredstavuje znak, je efektivnˇe odhadov´ana pomoc´ı pˇr´ıznak˚ us konstantn´ı v´ ypoˇcetn´ı sloˇzitost´ı a pouze regiony s lok´alnˇe maxim´aln´ı pravdˇepodobnost´ı jsou vybr´ any pro druhou f´ azi algoritmu, kde je klasifikace vylepˇsena jiˇz pomoc´ı v´ıce v´ ypoˇcetnˇe n´ aroˇcnˇejˇs´ıch pˇr´ıznak˚ u. Metoda je celosvˇetovˇe prvn´ı publikovanou metodou, kter´a se zamˇeˇruje na kompletn´ı probl´em lokalizace i rozpozn´an´ı textu – vˇsechny dˇr´ıve publikovan´e metody se zamˇeˇrovaly pouze na samostatn´e podprobl´emy. Druh´ a metoda pˇredstavuje novˇe navrˇzen´ y detektor znak˚ u, kter´ y je v´ yznamnˇe rychlejˇs´ı a kter´ y generuje v´ yraznˇe m´enˇe chybn´ ych detekc´ı neˇz standardnˇe pouˇz´ıvan´ y ER detektor. Tento detektor z´ aroveˇ n generuje segmentace jednotliv´ ych znak˚ u, kter´e jsou n´aslednˇe rovnou vyuˇzity pˇri rozpozn´av´an´ı. Jednotliv´e znaky jsou spojeny ve slova pomoc´ı efektivn´ıho algoritmu pro shlukov´an´ı textu, kter´ y, stejnˇe jako vˇsechny pˇredeˇsl´e kroky, je invariantn´ı v˚ uˇci zvˇetˇsen´ı a rotaci a kter´ y podporuje velkou ˇsk´alu jazyk˚ u a font˚ u. Tˇret´ı metoda vyuˇz´ıv´ a novou architekturu hlubok´e neuronov´e s´ıtˇe natr´enovan´e pro lokalizaci i rozpozn´ av´ an´ı textu. Tr´enov´an´ı vyuˇz´ıv´a syntetick´a data, ˇc´ımˇz odpad´a nutnost drah´eho a zdlouhav´eho ruˇcn´ıho anotov´an´ı. Tato metoda lokalizuje a rozpozn´av´ a veˇsker´ y text v obr´ azku v r´ amci jedin´eho pr˚ uchodu a dosahuje celosvˇetovˇe nejlepˇs´ı pˇresnosti na dvou standardn´ıch datov´ ych sad´ach. Z´aroveˇ n je o jeden ˇr´ad rychlejˇs´ı neˇz pˇredchoz´ı metody – algoritmus lokalizuje a rozpozn´av´a text rychlost´ı 10 sn´ımk˚ u za sekundu.
v
Acknowledgement I am especially grateful to my advisor Jiˇr´ı Matas for introducing me to computer vision and for attracting my interest towards science. I would not be able to pursue my PhD degree without his support, ideas and motivation. I would also like to thank Michal Buˇsta for his insights and for helping me certain aspects of the code and ultimately taking over the code maintenance and implementation. I would like to thank my CMP colleagues Ondra Chum, Michal Perdoch, Tom´aˇs ˇ Voj´ıˇr, Andrej Mikul´ık and Jan Sochman for their comments, ideas and for helping me through my PhD studies, and I would also like to thank Yash Patel for reviewing the text of the thesis on such a short notice. I also would like to acknowledge the support of the Google PhD Fellowship and the Google Research Award, which made my PhD studies possible. Last but not least, I am grateful to my loving wife Martina (and to my son Kryˇstof, although he did not have much say about all this) for supporting me in the pursue of my academic career and for putting up with the late nights, weekend work and the travel.
vii
Contents 1 Introduction 1.1 The Objective 1.2 Contributions 1.3 Publications . 1.4 Authorship .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3 3 5 6 7
2 Related Work 2.1 Text Localization . . . . . . . . . . . . 2.1.1 Sliding-window Methods . . . . 2.1.2 Region-based Methods . . . . . 2.1.3 Convolutional Neural Networks 2.2 Text Segmentation . . . . . . . . . . . 2.3 Cropped Character Recognition . . . . 2.4 Cropped Word Recognition . . . . . . 2.5 Word Spotting . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
8 8 10 11 13 14 17 18 21
3 Datasets 3.1 Chars74K Dataset . . . . 3.2 ICDAR 2003 Dataset . . . 3.3 ICDAR 2011 Dataset . . . 3.4 ICDAR 2015 Dataset . . . 3.5 Street View Text Dataset 3.6 COCO-Text Dataset . . . 3.7 MJSynth Dataset . . . . . 3.8 SynthText Dataset . . . . 3.9 IIIT Datasets . . . . . . . 3.10 MSRA-TD500 Dataset . . 3.11 KAIST Dataset . . . . . . 3.12 NEOCR Dataset . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
24 24 24 25 26 27 28 28 29 30 30 30 30
. . . . . . . . . . . .
31 31 31 32 35 39 42 43 47 47 47 51 53
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
4 Text Recognition by Extremal Regions 4.1 Character Detection . . . . . . . . . . . . . . 4.1.1 Extremal Regions . . . . . . . . . . . 4.1.2 Incrementally Computable Descriptors 4.1.3 Sequential Classifier . . . . . . . . . . 4.2 Text Line Formation . . . . . . . . . . . . . . 4.2.1 Character Recognition . . . . . . . . . 4.2.2 Sequence Selection . . . . . . . . . . . 4.3 Experiments . . . . . . . . . . . . . . . . . . . 4.3.1 Character Detector . . . . . . . . . . . 4.3.2 ICDAR 2013 Dataset . . . . . . . . . 4.3.3 Street View Text Dataset . . . . . . . 4.3.4 ICDAR 2015 Competition . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
1
5 Efficient Unconstrained Scene Text Detector 5.1 FASText Keypoint Detector . . . . . . . . . . 5.2 Keypoint Segmentation . . . . . . . . . . . . 5.3 Segmentation Classification . . . . . . . . . . 5.3.1 Character Strokes Area . . . . . . . . 5.3.2 Approximate Character Strokes Area . 5.4 Text Clustering . . . . . . . . . . . . . . . . . 5.5 Experiments . . . . . . . . . . . . . . . . . . . 5.5.1 Character Detection . . . . . . . . . . 5.5.2 Text Localization and Recognition . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
56 56 59 59 60 62 64 65 65 67
6 Single Shot Text Detection and Recognition 6.1 Fully Convolutional Network . . . . . . . 6.2 Region Proposals . . . . . . . . . . . . . . 6.3 Bilinear Sampling . . . . . . . . . . . . . . 6.4 Text Recognition . . . . . . . . . . . . . . 6.5 Training . . . . . . . . . . . . . . . . . . . 6.6 Experiments . . . . . . . . . . . . . . . . . 6.6.1 ICDAR 2013 dataset . . . . . . . . 6.6.2 ICDAR 2015 dataset . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
70 71 71 73 74 75 77 79 80
. . . . . .
81 81 81 81 82 82 82
7 Results 7.1 Applications . . . . . . . . 7.1.1 TextSpotter . . . . 7.1.2 Mobile Application 7.2 Available Code . . . . . . 7.2.1 FASText Detector 7.2.2 ER Detector . . .
. . . . for . . . . . .
. . . . . . . . . . . . . . Translation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
8 Conclusion
84
Bibliography
85
2
1 Introduction 1.1 The Objective Methods for Scene Text Localization and Recognition find all areas in an image or a video that would be considered as text by a human, mark boundaries of the text areas and output a sequence of characters associated with its content (see Figure 1.1). They are used to process images and videos taken by a digital camera or a mobile phone and to “read” the content of each text area into a digital format, typically a list of Unicode character sequences, that can be processed in further applications.
Figure 1.1 The Scene Text Localization and Recognition problem. All legible areas in an image with text are output as sequences of characters, which can be easily processed by a computer
Scene Text Localization and Recognition is an open and complex problem of computer vision, because text typically occupies only a small fraction of the image, it has non-uniform background, it suffers from noise, blur, occlusions and reflections and perspective effects need to be taken into account. Moreover real-world texts are often short snippets written in different fonts and languages, text alignment does not follow strict rules of printed documents and many words are proper names which prevents an effective use of a dictionary. All the aforementioned factors make the Scene Text Localization and Recognition problem significantly harder and standard printed document recognition (OCR) methods and applications cannot be used [99, 37] (see Figure 1.2). Standard computer vision object detection methods cannot be directly exploited for text either, because their output is an object class and position (e.g. a “plane” or a “cow”), but detecting “text” as a class without its content is not very useful (although it is already quite a challenging problem, see Section 2.1). One could try to overcome this issue by representing each possible character sequence as its own separate class, but such an approach can only be applied when all the sequences (words) to be detected are known in advance [134, 49]. Moreover, this approach does not generalize to arbitrary text detection and recognition, because the space of text content is exponential - given an alphabet A and a maximal text length L, there can be up to AL different text classes. The goal of the thesis is to advance the state of the art by introducing methods specialized in text localization and recognition, without constraining the methods to a particular domain of text, a font, a script, or a type of scene the text is captured 3
1 Introduction
Figure 1.2 The difference between printed document OCR (left) and Scene Text Localization and Recognition (right)
Figure 1.3 An aid for visually impaired people. Any detected text is automatically read out loud using a speech synthesizer
in. Methods for scene text localization and recognition are very useful, because text captures important information and cues about the scene and they are also an important component of general image understanding, as text is one of the most frequent classes in the real world. For example, in the COCO dataset for object detection, there are over 173 000 text instances which makes it the most frequent class [129]. There are also many possible applications of scene text localization and recognition, with some particular examples listed below. Helping visually impaired. It is hard or even impossible for visually impaired or elderly people to read text, but text often provides crucial information, which sometimes cannot even be possibly obtained in any other way - for example, the only way to distinguish between two different types of drugs can be just by reading the label on its packaging. Linking the scene text recognition algorithm with a speech synthesiser and deploying it on a mobile phone gives an easy-to-use and affordable assistive tool, which is able to help by automatically detecting any text in the video stream and reading it out loud (see Figure 1.3). The method must not only be accurate but also efficient, because text has to be automatically detected directly from the live camera stream, as it is not reasonable to expect that visually impaired people would take pictures of text. Urban navigation. Navigation in urban environments or inside buildings cannot rely on position information from GPS, so alternative methods have to be exploited. One of the options is using business labels and other signs as one of the cues to determine the position, and a scene text recognition algorithm is therefore required to transform the surrounding textual information into a form which can be further processed. The same applies for self-driving cars, where GPS and maps are available, but they may not include temporary signs, which are often text-based. 4
1.2 Contributions Automated translation. One of the issues in machine translation applications is the user input. Traditionally, a user has to type in the word/phrase that should be translated, however this can be quite slow, error-prone and sometimes close to impossible, if the user is not familiar with the alphabet of the text which he is trying to translate (for example a European tourist in China trying to translate Chinese text). Automated scene text recognition overcomes the issue by automatically recognizing text in a picture taken by the user, possibly with a guidance from the user to select which areas of the image to translate, effectively eliminating the need for any manual input (see Section 7.1.2). Indexing and searching image databases by textual content. An arbitrary image or video databases can be automatically indexed by its textual so that user can search by text queries, which is the most common input to search engines. A user could find his favorite restaurant in a database such as Google Street View using just a restaurant’s name, he could find a movie poster on Flickr or he could find an interview with a certain person just by searching a TV news video archive.
1.2 Contributions The main contributions of this thesis towards extending the state of the art in scene text localization and recognition are 1. An end-to-end method joining both text localization and text recognition into a single pipeline is proposed (see Chapter 4). Our method [88] was the first one to address this problem as a whole, thus allowing practical applications - all previous work focused solely on different subproblems. 2. Character detection is posed as an efficient sequential selection from the set of Extremal Regions (see Section 4.1.1). The ER detector is robust to blur, illumination, color and texture variation and handles low-contrast text better than the standard MSER detector [80]. The newly introduced features, which are incrementally computed (see Section 4.1.2), allow the method to run in real time [91], which was a significant improvement over previous methods. 3. We propose a novel easy-to-implement stroke detector (see Section 5.1), which is significantly faster and produces significantly less false detections than the detectors commonly used by scene text localization methods. Following the observation that text in virtually any script is formed of strokes, stroke keypoints are efficiently detected and then exploited to obtain stroke segmentations. 4. The concept of text fragments is introduced (see Section 5.3), where a text fragment can be a single character, a group of characters, a whole word or a part of a character, which allows to drop the common assumption of region-based methods that one region corresponds to a single character. A novel Character Strokes Area is introduced, effectively approximating “strokeness” of text fragments, which plays an important role in the discrimination between text and a background clutter. 5. A lexicon-free text recognition method trained purely on synthetic data is presented (see Section 4.2.1). The method does not rely on any prior knowledge of text to be detected (lexicon), unlike the majority of the methods in the literature (see Section 2.5). The proposed method also shown that synthetic data can be 5
1 Introduction successfully exploited for training scene text recognition, which is an idea that has been successfully built upon and extended by several authors [45, 41]. 6. The proposed method is amongst the first ones to detect and recognize text in scene videos. Combined with the standard FoT tracker [131], it set a baseline for end-to-end video text recognition (see Section 4.3.4) and it outperformed all other participants. To our knowledge, no other method has published better results in the end-to-end video text recognition. 7. A deep model which is trained for both text detection and recognition in a single learning framework is presented (see Chapter 6). The model is trained for both text detection and recognition in a single end-to-end pass and it outperforms the combination of state-of-the-art localization and state-of-the-art recognition methods.
1.3 Publications This thesis build on the result previously published in the following publications • A method for text localization and detection, ACCV 2010 [88] - the paper introduces a first version of the end-to-end text detection and localization pipeline, initially based on the MSER [80] detector. • Real-Time Scene Text Localization and Recognition, CVPR 2012 [91] - an efficient Extremal Region classifier, detailed in Section 4.1.3, is introduced, allowing the whole text detection and recognition pipeline to run in real time. • On Combining Multiple Segmentations in Scene Text Recognition, ICDAR 2013 [92] - the paper describes the sequence selection algorithm to generate the final output, detailed in Section 4.2.2. The paper received the ICDAR 2013 Best Student Paper Award. • Efficient Scene Text Localization and Recognition with Local Character Refinement, ICDAR 2015 [94] - the paper introduces the concept of text fragments and the Character Strokes Area feature, described in Section 5.3.1. The paper received the ICDAR 2015 Best Paper Award. • Real-time Lexicon-free Scene Text Localization and Recognition, TPAMI [95] - the journal paper describes the text detection and recognition pipeline of Chapter 4 • FASText: Efficient Unconstrained Scene Text Detector, ICCV 2015 [17] - the paper introduces the FASText scene text detector, described in Chapter 5. • Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework [18] - the paper presents the Single Shot Text Detection and Recognition pipeline, described in Chapter 6.
6
1.4 Authorship The following publications were not included in the thesis, in order to keep the thesis more focused and easier to follow • Estimating hidden parameters for text localization and recognition, CVWW 2011 [89] • Text Localization in Real-world Images using Efficiently Pruned Exhaustive Search, ICDAR 2011 [90] • Scene Text Localization and Recognition with Oriented Stroke Detection, ICCV 2013 [93] • ICDAR 2015 competition on robust reading, ICDAR 2015 [52] • COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images, WACV 2016 [129]
1.4 Authorship I hereby certify that the results presented in this thesis were achieved during my own research, in cooperation with my thesis advisor Jiˇr´ı Matas. The work presented in Chapter 5, Chapter 6 and Section 7.1.2 is a joint work with Michal Buˇsta, who contributed to the ideas and experiments presented in these sections.
7
2 Related Work In the research community, the problem of Scene Text Recognition was originally broken down into several sub-tasks, because a complete scene text recognition problem was deemed as too complex to solve. The split into these subtasks was also motivated by the traditional printed document OCR pipeline, which had been already successful in printed text recognition [72], so the idea was to only replace the initial stages of the printed document pipeline to make it work for scene text as well.
2.1 Text Localization Historically, the main focus of the scene text research community was given to the text localization problem, because it is the first stage of any scene text recognition pipeline and it was shown to be already very challenging, even without a subsequent recognition phase. In the text localization task, given an image I a method must find all rectangular areas (x, y, w, h), which a human (annotator) would consider as text n I → (xi , yi , wi , hi ) xi , yi , wi , hi ∈ N (2.1)
where x and y denote top-left corner co-ordinates and w (h) denote the width (resp. the height) of the rectangle. The qualityof the output is measured by comparing the set of the detected rectangles D = {Di } = (xi , yi , wi , hi ) with the set of rectangles provided by a human annotator G = {Gi }. The standard evaluation protocol, as proposed by Wolf and Jolion [139], defines the “recall” r and the “precision” p as r = p =
Pn
m(Di ,G) |G|
(2.2)
Pn
m(Gi ,D) |D|
(2.3)
i=1
i=1
where m(r, R) is a “match function” whose value depends on whether the rectangle r is matched to one or more rectangles in the rectangle set R. In the ICDAR Robust Reading competition [118, 53, 52], text is annotated on the word level1 , so methods are expected to output bounding boxes of individual words, but detecting multiple words with a single rectangle or detecting a single word with multiple rectangles should not be penalized as heavily as not detecting the word at all. This requirement therefore resulted into the following match function definition if ∃!t ∈ R : σ(r, t) ≥ 0.8 ∧ τ (r, t) ≥ 0.4 (one-to-one match) 1 0.8 if ∃S ⊂ R : P 0 si ∈S0 σ(r, si ) ≥ 0.8 ∧ ∀si ∈S0 τ (r, si ) ≥ 0.4 (one-to-many match) m(r, R) = P 0.8 if ∃S0 ⊂ R : ∀si ∈S0 σ(r, si ) ≥ 0.8 ∧ si ∈S0 τ (r, si ) ≥ 0.4 (many-to-one match) 0 otherwise (2.4) 1
8
Even though the competition and the data set works solely with word bounding boxes, the definition of word is not given. For example, it is not clear whether a phone number is single or multiple words, or whether hyphens should break a word into two
2.1 Text Localization
Figure 2.1 Text localization output example. Given given an image I (left) two rectangular areas (4, 522, 207, 583), (230, 521, 471, 582) are detected (right).
Singapore
Singapore
(a)
Invisible (b) Figure 2.2 The standard text localization protocol [139] does not penalize loose bounding boxes (a) or partial detections which could change meaning of the word (b) - all detections illustrated above would achieve 100% recall and 100% precision. Detections denoted red, ground truth in green
Registration Registration Figure 2.3 The text localization protocol [139] penalizes errors more in the vertical direction the detection on the left could still be correctly recognized in the OCR stage, yet it achieves 0% recall and precision, whilst the detection on the right achieves 100% recall and precision, but it is impossible to get a correct recognition since a part of the word is missed completely. Detections denoted red, ground truth in green
9
2 Related Work where σ and τ denote rectangle “recall” and “precision” area ratio σ(x, y) = τ (x, y) =
area(x ∩ y) area(y) area(x ∩ y) area(x)
(2.5) (2.6)
Although this protocol has been heavily used in the literature as well as in the recent Robust Reading competitions [118, 53, 52], there are several inherent problems in it. The protocol motivates methods to over-estimate the text area, since the precision threshold for successful detection is quite permissive (τ (r, t) ≥ 0.4) and therefore the protocol doesn’t penalize bounding boxes which are only loosely around text (see Figure 2.2a). Equally, the required area overlap of 80% with the ground truth (σ(r, t) ≥ 0.8) does not guarantee that the whole word is detected, and the undetected 20% of the word can completely change the word’s meaning (see Figure 2.2b). The dependence purely on the area ratio without any regard for the direction of text implies that for longer words the protocol penalizes errors more in the vertical direction, which is completely counter-intuitive - a tight word bounding box in the vertical direction can be interpreted as not a match, even though all characters inside it are still readable, whereas not detecting a part of the word in the horizontal direction is still considered a valid match (see Figure 2.3), as in the previous case. These limitations of the standard protocol should be taken into account when interpreting text localization performance and when making comparison between text localization methods, since even a method which would achieve 100% recall and 100% precision could still be impractical for subsequent scene text recognition, and viceversa, a method with lower localization recall and precision can still achieve better overall recognition performance. Localizing text in an image can be a computationally very expensive task as generally any of the 2N subsets can correspond to text (where N is the number of pixels). Existing methods for general text localization can be categorized into two major groups - methods based on a sliding window (Section 2.1.1) and methods based on connected components (Section 2.1.2). The CNN-based methods (Section 2.1.3) also in principle fall into the first group.
2.1.1 Sliding-window Methods The “sliding-window” methods use a window which is moved over the image and the presence of text is estimated on the basis of local image features, which is an approach successfully applied in generic object detection [130]. While these methods are generally more robust to noise in the image as the features are aggregated over a larger image area, their computation complexity is high because of the need to search with many rectangles of different sizes, aspect ratios, rotations and perspective distortions, which is an effect that does not occur in generic object detection. Chen and Yuille [20] use AdaBoost classifier [116] combining mean intensity features, intensity variance features, derivative features, histogram features and features based on edge linking. A variant of Niblack’s adaptive binarization algorithm [97] is then used to obtain segmentation. The method is computationally expensive (it takes 3 seconds to process a 2MPix image), it requires manual annotation of many subwindows for training purposes (see Figure 2.4) and its localization performance is not clear (standard evaluation protocol was not used). 10
We decided to keep age regions most easily confused with text were vegetation, the number of fals repetitive structures such as railings or building facades, and image (almost all o some chance patterns. 2.1 Text Localization ing stage). Thresh 0.00 -0.05
Figure 7. Text example used for getting pos-
False P 1 18
Table 1. Perfo ent threshold performance positives and
Figure 2.4 Positive training data samples used for sliding-window classifier training. Image itive examples for training AdaBoost. Obtaken from [20]
server the low quality of some of the examples.
We illustrate the The method was improved by Pan et al. [103] by incorporating a combination Histogram Of Gradient (HOG) and multi-scale Local Binary Pattern feature in the AdaBoost text classifie detection stage. Furthermore, a Markov Random Field (MRF) is employed to to group dataset, see figure ( segmented characters into words, in contrast to the heuristic rules applied in [20]. The previous section describedthan the the weak classifiers we 2005 Text method claimsThe better localization performance winner of ICDAR used for training AdaBoost. We used standard 6. Extension a Locating Competition [76], which is not fully accurate becauseAdaBoost different evaluation units were training used (words vs. lines of text). the method methods to learn theAdditionally, strong classifier [4] [5] suffers com- from high computational complexity (average processing time 1.5s on a 1MPix image). Our next stage p bined with Viola and Jones’ cascade approach which uses More recently, Lee et al. [64] further improved the approach by incorporating more as inputs to the OC asymmetric weighting [19]. The cascade approach enables discriminative but also more computationally expensive features, which slightlyOCR im- directly on i the algorithm to rule out most of the image as text locations proved text localization performance, but significantly slowed down the method (protially worse perform with a few tests (so we do not have to apply all the tests evcessing time is several minutes per image). we must e in the image). This makes the algorithm extremely Coates eterywhere al. [23] use unsupervised machine-learning techniques independentlyrization, for aBoost strong clas fast when applied to the A test datasetpixel and yields of magcharacter detection and recognition. 32-by-32 windoworder is shifted over the image miss speed-up overisstandard [19]. Our algo- as text in multiple nitude scales and each patch classified AdaBoost using a linear SVM classifier or letters or digit non-text. Features used by the arelayers. generated We start by app rithm had a total of classifier 4 cascade Eachautomatically layer has 1, in10,the training stage using30, a variant of the K-means algorithm. Cropped characters are recognized text regions detecte 50 tests respectively. The overall algorithm uses 91 difby resizing ferent them into a fixed 32-by-32 pixel window and applying the same process. is followed by a con feature tests. The first three layers of the cascade only The method however does not provide end-to-end text recognition as the characters are cropped manually. Zhu and Zanibbi [149] build on top of this work by introducing multi-stage generation and validation of character detections using convolutional, geometric and contextual features, and by exploiting a confidence-weighted AdaBoost classifier to obtain text/non-text saliency map. 2.1.2 0-7695-2158-4/04 Region-based Methods $20.00 (C) 2004 IEEE Region-based methods exploit a bottom-up strategy for text localization. At first, certain local features are calculated for each pixel in the image and then pixels with similar feature values are grouped together using connected component analysis to form characters, assuming low variance of the used feature(s) within a single character. The methods can be scale-invariant and they inherently provide character segmentation, which can be then used in an OCR stage. The biggest drawback is their sensitivity to noise and to low-resolution images, because they require low variance of the local features and a single pixel with different feature value can cause the connected component analysis to fail. 11
2 Related Work
…
…
(a)
(b)
(c)
(d)
Figure 2. Example of the pre-processing stage. (a) the original image. (b) text confidence maps for the image pyramid (bright Figure 2.5 Text (a) the original image. (b)map textforconfidence maps for (d) thebinarized image. of pixels represents theconfidence probabilitymap. as ”text”). (c) the text confidence the original image. image pyramid. (c) the text confidence map for the original image. Images taken from [104] Unary feature Binary feature 4.2 CC labeling with CRF normalized width centroid distance One of the first methods for general character localization was introduced by Ohya normalized height scale ratio et al. [100]. In their method, a local adaptive thresholding in grey-scale images is Based on the definition of CRF, we formulate CC analyaspect ratio shape difference usedCC to labeling detect candidate andcomponent regions with areratio selected as (horizontal and vertical) occupy sis into problem: regions given the set sufficient contrast characters. Li et al. [67] apply thresholding in a quantized color space and they group compactness overlap degree X = (x1 , x2 , ...), on which a 2D undirected graph is conconfidence (horizontal and vertical) individual characters by simplelabel alignment rules. Both methods assume structed, the objective is tointo findtext the blocks best component ∗ that∗ characters ∗ contour gradient are upright without any rotation, is high(R,G,B) and thatcolor difference (R,G,B) the total graph energy E. that the contrast Y = (y1 , y2 , ...) to minimize average run-length confidence background is uniform, which may be sufficient for signs or licence plates, but not for number (minimum and maximum) general text localization.
4.2.1
Neighborhood Graph Kim et al. [56] combine three independent detection channels (color continuity, Table 1: Unaryedge and binary features. detection color variance) to relationship find candidate regions. Candidate regions are grouped Considering theand geometric and spatial of cominto we blocks by size position constraints each block divided overponents, construct theand neighborhood graph with and a com4.2.3is then Learning andinto Inference ponent linkage defined lapping 16rule × 16 pixel as subblocks, which are verified by an SVM classifier [24], exploitparameter estimation of the ing wavelet transform to generate features. If a ratio ofFor subblocks marked as text is CRF model, we use M dist(x ) < 2a ∗ min( max( wthreshold, wj block , hj ) ),is (4) Classification Erroris (MCE) criterion [2] since it i , xjthan i , hi ), max( higher predefined the markedimum as text. The method not be directly integrated with the MLP optimization. In M scale-invariant because of the constant size of subblocks used for classification and the where dist(·, ·) is the centroid distance between two comtraining, the misclassification measure can be approxima precision is relatively bad because many small text-like patches are detected. c ponents and w and h are component width and height rebyvertices d(X, Λ)represent = −E(X, Y , N, Λ) + E(X, Y r , N, Λ), wh Takahashi and Nakajima [123] use a graph scheme, where characters c r spectively. Any two components whose spatial relationship and Y of are the true and rival label respectively an and edges represent a predecessor-successor relation in Ya block text. Candidate obeys this rule can be linked together by an edge. represents CRF[21] model parameters {ωun , ωbi , ωc }. regions are found using Canny edge detector in the CIELUV color space and regions measurement can be transformed that pass a set of heuristic constraints are considered vertices of the graph. Edges are into loss function between neighboring vertices and each edge is assigned a weight based on spatial 1 4.2.2created Energy Function L(X, Λ) = , 1 + exp(−ξ(d(X, Λ))) distance, shape and area similarity. Finally, a minimum spanning tree (MST) algorithm Considering theand effectiveness efficiency, we over utilize is applied edges withand distance or angle a predefined threshold are removed. based on which parameters can be iteratively optimized unary and binary cliques on the graph to construct the CRF The method cannot handle illumination variations in foreground or background and stochastic gradient decent algorithm as model, where multi-layer perceptron (MLP) is selected to optimal the edge weight remains an open question. approximate unary andestimation binary energy function. The to∂L(X, Λ) Panfunction et al. [104, 105] as create a text confidence map (see Figure 2.5)Λ on a= grey-scale tal energy is defined Λt − εt · |Λ=Λt . t+1 image pyramid using a calibrated Waldboost [122] classifier with Histogram Of Gradient ∂Λ (HOG) on a grey-scale image E(X, Y, N,features. λ) = (Candidate (Eun (xi , yiregions , ωun ) +are detected independently When energy function parameters are learned fixed, gr using Niblack’s ibinarization algorithm [97] and a Conditional Random Field (CRF) [59] cuts (α-expansion) algorithm [1] is selected to find the regions as text or non-text, considering the ∗text confidence as one is employed to label label Y of components to minimize the total energy si ωc · Ebi (xi , xj , yi , yj , j ∈ ni , ωbi ) ),(5) of the unary features. Then, a simple gradient graph energy minimization approach is it can achieve approximate optimal results and is much ci applied to form block of texts. The method is computationally expensive because of cient than other inference algrothms. the image and the E CRF inference and its localization performance cannot be we use coupling strat where values ofpyramid Eun (·, ωun ) and (·, ω ) are outputs of During the training procedure, bi bi compared to other methods proprietary(both evaluation was used. two-class (”text”, ”non-text”) andathree-class texts, metric to learn energy function parameters: at each time, the An image Widthon Transform was Epshtein et graph cuts is used to la both non-texts andoperator differentStroke style) MLPs unary and(SWT) biergyintroduced function isbyfirst fixed and [29]. The SWT method finds edges using Canny detector [19] and then estimates is a combination coefficient. Unary nary al. features, and ω components, then the total energy value for fixed graph c stroke width for each pixel in the image (see Figure 2.6). Connected component algoand binary features (defined in Table 1, refer to [11]), some bels is used to optimize parameters based on MCE cr of which are calculated with the text confidence map, are rion. This updating process continues until the total ene extracted to represent the component property and compoonly have very few changes. During the test procedure 12 nent neighboring relationship. speed up the process, some apparent non-text compon 8
tection in natural images. The suggested d data dependent, which makes it fast and minate the need for multi-scale computation s. Extensive testing shows that the suggested s the latest published algorithms. Its algorithm to detect texts in many fonts and
natural images exhibit a wide range of imaging conditions, such as color noise, blur, occlusions, etc. Finally, while the page layout for traditional OCR is simple and structured, in natural images it is much harder, because there is far 2.1 Text Localization less text, and there exists less overall structure with high variability both in geometry and appearance.
n
n natural images, as opposed to scans of es and business cards, is an important of Computer Vision applications, such id for visually impaired, automatic geosses, and robotic navigation in urban rieving texts in both indoor and outdoor ides contextual clues for a wide variety (a) (b) Moreover, it has been shown that the image retrieval algorithms depends Figure 2.6 Stroke Width Transform (SWT). (a) the original image converted to grey-scale. (b) performance of their text detection stroke width estimation for each pixel. Images taken from [29] mple, two book covers of similar design ent text, prove to be virtually without detecting and OCRing the text. is then applied to form pixels with similar stroke width into character candidates, xt detection was consideredrithm in a number are merged into text blocks using several heuristic rules. The biggest limitation [1, 2, 3, 4, 5, 6, 7]. Two which competitions of the[8]method Competition at ICDAR 2003 and is its dependency on successful edge detection which is likely to fail on have been held in order blurred to assessorthe low-contrast images. The method was further improved by Yao et al. [141] he qualitative results of thewhere competitions the heuristic rules for (c) character candidate detection (d) and text block formation are here is still room for improvement Figure 1: with The SWT converts the image (a) from containing replaced (the by trained classifiers rotationally-invariant features. A similar edge-based 2005 text location competition shows gray values to an array containing likely stroke widths for in [148]. approach with different connected component algorithm is presented ecision=62%). This work deviates from each pixel (b). This information suffices for extracting the Many of the text localization methods rely on the Maximally Stable text by measuring the width variance in each component as Extremal Regions by defining a suitable image operator [80].shown The idea use MSERs fortocharacter detection in (c)tobecause text tends maintain fixed stroke (and recognition) bles fast and dependable(MSER) detectiondetector of This puts it apart otherthey imageclassified elements such wasTransform first exploited bywidth. Donoser et al. [28],from where MSERs using crossoperator the Stroke Width as foliage. The detected text is shown in (d). nsforms the image data from containingwith training correlation templates and then used standard search engine to select the feature that other elements of a ixel to containing the mostmost likelyprobable stroke text One hypothesis andseparates correct text anyfrom misspellings. scene is its nearly constant stroke width. This can be ting system is able to detect text Kang et al. [51]utilized exploits [57]Into group individual to Higher-Order recover regions Correlation that are likelyClustering to contain text. ale, direction, font and language. MSERs into text this lineswork, and we to eliminate non-textual MSERs. et al. [144], the winleverage this fact. We show that Yin a local o images of natural scenes, the success ner of the ICDAR 2013 Robust Reading competition [53], uses a single-link clustering image operator combined with geometric reasoning can be op drastically, as shown in Figure 11.
algorithm with a trained distance. Later, the method was improved by using adaptive hierarchical clustering for grouping of MSERs and by supporting multi-oriented text [143]. Huang et al. [44] combine the MSER detector with a CNN classifier to 10/$26.00 ©2010 IEEE distinguish between text and non-text regions and to split individual MSERs which correspond to multiple characters. Our work in the text localization and recognition in Chapter 4 also builds on top of the MSER detector and its generalizations [81].
2.1.3 Convolutional Neural Networks Tian et al. [124] combine the traditional sliding-window approach (see Section 2.1.1) with a CNN classifier [61]. Candidate regions are first detected by sliding window which is classified as text or non-text using a fast cascade boosting algorithm. The text line extraction is then formulated as a min-cost flow problem, where each node is the candidate region (patch) with an associated cost provided by a CNN classifier. The CNN classifier is trained on the image patches obtained by the character detector and it only has three convolutional layers and two fully-connected layers. He et al. [42] train a CNN to classify 32 × 32 patches as text/non-text. The training exploits multi-task learning [31] paradigm as the optimized cost function is not simply the text/non-text label, but it also incorporates explicit pixel-level segmentation and the character label. In the testing phase, only regions detected as MSERs [80] in a 13
2 Related Work 2 Z. Tian, W. Huang, T. He, P. He and Y. Qiao
(a)
(b)
Figure architecture Tian et al. [125] employs a 3 × 3 sliding-window on the We last Fig. 1:2.7 (a) The Architecture ofofthe Connectionist Text Proposal Network (CTPN). convolutional layer as an input to a RNN, which jointly predicts the text/non-text score, densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) the y-axis coordinates and the anchor side-refinement (a). Sample network output, color ofindicates the VGG16 model [27].score The(b). sequential windows in each row are recurrently the text/non-text Images taken from [125]
connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including contrast-enhanced image considered the classification the CNN, resulting in two 128D LSTMs). Theare RNN layer is for connected to a 512Dbyfully-connected layer, afollowed text saliency map. by the output layer, which jointly predicts text/non-text scores, y-axis Gupta et al. [41] propose a fully-convolutional regression network, drawing inspiration coordinates and side-refinement offsets of k anchors. (b) The CTPN outputs from the You Look Only Once (YOLO) approach for object detection [108]. An image is sequential fixed-width fine-scale text proposals. Color of each box indicates the divided into a fixed number of cells (14 ×14 in the highest resolution), where each cell is text/non-text score. Only the boxes with positive scores are presented.
associated with 7 values directly predicting the position of text: bounding-box location (x, y), bounding-box size (w, h), bounding-box rotation (cos θ, sin θ) and text/non-text confidence (c). The values are estimated by 7 local translation-invariant predictors builtCurrent on top ofapproaches the first 9 convolutional layers ofmostly the popular VGG-16 architecture [120], for text detection employ a bottom-up pipeline trained on synthetic data (see Section 3.8). [28,1,14,32,33]. They commonly start from low-level character or stroke detecTianwhich et al. is[125] adapt followed the Region Networks architecture by horizontion, typically byProposal a number of subsequent steps:[110] non-text comtally sliding a 3 × 3 window through on the last convolutional map of the VGG-16 [120] ponent filtering, text line construction and text line verification. These multi-step and applyingapproaches a Recurrentare Neural Network to jointly with predict text/non-text score, bottom-up generally complicated lessthe robustness and reliathe y-axis coordinates and the anchor side-refinement (see Figure 2.7). Note that the bility. Their performance heavily rely on the results of character detection, and architecture expects that text is only horizontal, unlike the method of Gupta et al. [41]. connected-components methods or sliding-window methods have been proposed. Similarly, Liao et al . [70] adapt the SSD object detector to detect These methods commonly explore low-level features[74] (e.g., basedhorizontal on SWT bound[3,13], ing boxes. MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. Ma et al .they [78] are adapt Faster architecture andstrokes extend or it to detect text However, notthe robust byR-CNN identifying individual characters sep-of different orientations by adding anchor boxes of 6 hand-crafted rotations and 3 aspects. arately, without context information. For example, it is more confident for people
to identify a sequence of characters than an individual one, especially when a character extremely ambiguous. These limitations often result in a large num2.2 Textis Segmentation ber of non-text components in character detection, causing main difficulties for The task segmentation task formulation was introduced the ICDAR 2013are Robust handling them in following steps. Furthermore, theseinfalse detections easReading competition [53], in order to obtain more fine-grained evaluation than in the ily accumulated sequentially in bottom-up pipeline, as pointed out in [28]. To text localization task (Section 2.1) andstrong in order to address of the aforementioned address these problems, we exploit deep featuressome for detecting text inforproblems with the text localization protocol. In the text segmentation task, a method mation directly in convolutional maps. We develop text anchor mechanism that must label all pixels of an image I so that pixels which a human would consider as accurately predicts text locations in fine scale. Then, an in-network recurrent belonging to text are labelled as 1 (andthese the other pixels text are labelled as 0). architecture is proposed to connect fine-scale proposals in sequences, allowing them to encode rich context information. I → {0, 1}w×h (2.7) Deep Convolutional Neural Networks (CNN) have recently advanced general object substantially [25,5,6]. The state-of-the-art method is Faster where w (h) detection is the image width (height). Region-CNN (R-CNN) system [25] where a Region Proposal[22], Network (RPN) is The evaluation protocol, proposed by Clavelli and Karatzas uses “atoms” as units for the evaluation. Atom is a connected component (see Figure 2.9), which typically corresponds to a single character, but can also correspond to a part of a character 14
2.2 Text Segmentation
Figure 2.8 Text segmentation. Given given an image I (left) pixels labelled as belonging to text are marked black (right).
Figure 2.9 Text segmentation ground truth. Each atom denoted by a different color (right).
in case the character is broken into multiple segments, or to multiple characters, when characters are joint together. Given a method output, each atom in the ground truth is labelled as either WellSegmented, Merged, Broken, Broken and Merged or Lost, based on a match heuristics [22]. Additionally, detected components which cannot be matched to any ground truth region are marked as False Positive. The heuristics labels an atom as WellSegmented, if at least Tmin = 0.5 of the ground truth skeleton pixels are part of the atom (Minimal Coverage condition), and if none of the atom pixels are further than Tmax = min(5, 0.9sw ) from the ground truth edge (Maximal Coverage condition), where sw denotes ground truth stroke width. The image recall r precision p are then calculated as r = p =
|WS| |G| |WS| |D|
(2.8) (2.9)
where WS are the Well-Segmented atoms, G are the ground truth atoms and D are the detected atoms. The main drawback of the text segmentation task (and its evaluation protocol) is the dependency on pixel-level comparison, which may not always correspond to the quality one would obtain from a subsequent OCR stage. For instance, missing 30% of pixels 15
2 Related Work
Figure 2.10 The text segmentation protocol [22] may not always provide good clues whether characters could be successfully processed by a subsequent OCR stage - source image (left) and sample segmentation output (right). Well-Segmented atoms in green, Broken atoms in orange
Fig. 4. Six results of our method on various texts from
Fig. 6. Results of various algorithms on different images. From top to bottom and left to right: Original image, Niblack thresholding, Sauvola thresholding and our method. More characters are segmented with our method.
Figure 2.11 Toggle Mapping morphological segmentation [32] assample IGN [10] image database. Segmentation is difficult there is results, each connected component denoted by aawide different color. Images taken from [32] variety of text: text style, illumination and orientation may vary. Decorations (illustrated background, relief effect on characters...) may decrease readability.
5. EVALUATION
16
We have seen the results of our method. Now we measure its efficiency. To do so, we compare our method with two thresholding methods (that uses Niblack [5] and Sauvola [11] criterion respectively as both are references in text segmentation) and the ultimate opening [8] (because of its efficiency). To perform a comparison, we focus on different aspects: the
Segmentation quality assessment The segmentation evaluation is always difficult as it is, for a part, subjective. It is, frequently, impossible to have a ground truth to be use with a representative measure. To evaluate segmentation as objectively as possible for our application, we segment the image database and we count every properly segmented characters. For us, properly segmented means that the character is not split or linked with other features around it. The character must also be readable. The thickness may vary a little provided that its shape remains correct. The evaluation image database contains 501 readable characters. The following table gives the result of each method: Ultimate Opening Sauvola
% of properly segmented characters 48,10 71,26
2.3 Cropped Character Recognition (missing 30% of the skeleton) can significantly change the OCR class as for example the character “O” can change to “U”, yet the evaluation protocol would still label the atom as Well-Segmented (see Figure 2.10). On contrary, missing 60% of pixels when compared to the ground truth may still leave the OCR label intact. Moreover the task segmentation task enforces certain pipeline architectures (architectures similar to traditional printed document OCR), but the segmentation step may be successfully omitted in the end-to-end setup (see Section 2.5). All the methods listed in the Section 2.1.2 inherently provide some form of text segmentation, due to their reliance on connected components. There are however several methods which focus purely on text segmentation. Fabrizio et al. [32] use the Toggle Mapping morphological operator [117], which is a generic operator which maps a function into a set of n functions. In their method, the image (function) is mapped to a set of 2 functions (morphological erosion and morphological dilatation) and the mapping is further thresholded by requiring a minimal contrast to eliminate noise in homogeneous regions. Kumar and Ramakrishnan [58] employ Otsu binarization [101] in each of the RGB channels independently to find connected components. Then, each connected component in its respective color plane is classified as text or background based on its thinned representation and text components are formed into text lines by a horizontal clustering algorithm, which spans all three color planes. Mancas-Thillou and Gosselin [79] study cropped word binarization by clustering in RGB color space using Euclidian and Cosine distance. Kim et al. [55] exploit user interaction on a mobile device to find initial location of text and then color clustering in HCL color space [115] is used to find initial candidate regions. The regions are then expanded in horizontal direction using a set of heuristic rules to obtain blocks of text. Let us note that in the last text segmentation competition [53], all methods specialized in text segmentation [32, 58] were outperformed by a generic text localization method [144], where the character segmentations are simply the output of the MSER detector [80].
2.3 Cropped Character Recognition The cropped character recognition is the simplest text recognition task formulation - given an image of a character, a method outputs a single character of the output alphabet A: I→c:c∈A
(2.10)
Yokobayashi and Wakahara [146] binarize character images using local adaptive thresolding in one of the Cyan/Magenta/Yellow (CMY) color planes, depending on the breadth of their histogram. Then they use a normalized cross-correlation as a matching measure between the test image and a list of templates, which were synthetic images of a single font. To our knowledge they were the first ones to use synthetic data as training samples for scene text, an idea which we also exploited in Chapter 4. Li and Tan [68] calculate Cross Ratio spectrum over the boundary of a region. The Cross Ratio is a ratio between distances of four collinear points and it remains unchanged under any perspective transformation. Comparing two Cross Ratio Spectra employs Dynamic Time Warping algorithm and therefore has a quadratic complexity for a single comparison, which is the main drawback of the method - classifying a single character can take several minutes on a standard PC. 17
2 Related Work
Figure 2.12 Cropped character recognition. Given an image of a character I (left), a single character is selected from the output alphabet (right).
De Campos et al. [26] evaluate six different local features - Shape Contexts [12], Geometric Blur [13], Scale Invariant Feature Transform [75], Spin Image [60], Maximum Response of Filters [127] and Patch Descriptor [128] - for individual character recognition. Together with the published Chars74k dataset (see Section 3.1), their work represents a useful baseline in the scene text recognition. Newell and Griffin [96] extend the work of De Campos et al. [26] by exploiting Histogram of Oriented Gradients (HOG) [25] in a scale-space pyramid. Lee et al. [62] learn a set of discriminative features which exploit the most informative sub-regions of each character within a multi-class classification framework.
2.4 Cropped Word Recognition In the cropped word recognition, a method outputs a sequence of characters based on an image I of a single word which was manually cropped out by a human annotator. I → (c1 , c2 , . . . , cn ) : ci ∈ A
(2.11)
where A is the output alphabet (typically “A” to “Z”, sometimes lowercase “a” to “z” and “0” through “9” are added). In the evaluation, a standard case-sensitive Levenshtein distance [66] between the method output and the ground truth is calculated for each word (image). The overall method accuracy is then given by the sum of the Levenshtein distance over the whole test set.
Figure 2.13 Cropped word recognition. Given an image of a single word I (left), a sequence of characters is produced (right).
The cropped word recognition tasks gives an upper-bound currently achievable in plain scene text recognition performance, but in fact it assumes that there exists a text localization method with a 100% accuracy, which currently is far from being true. Moreover, since the text was localized by a human, it is not clear that such text localization is even possible without the recognition, because the human annotator could have used the actual content of the text to create the annotation for localization. In other words, it may be impossible to correctly localize text without the recognition phase. Also, there are words which cannot be correctly recognized without further context, which is unavailable when the word has already been cropped out. Weinman et al. [135] combine lexicon, similarity and appearance information into a joint model and use Sparse Belief Propagation [102] to infer the most probable string 18
2.4 Cropped Word Recognition Large-Lexicon Attribute-Consistent Text Recognition in Natural Images
7
(a)- detected characters
(b)- initial transducer the initial transducer from sampled characters. (a) – a sample image with Figure Fig. 2.143. Constructing The cropped word recognition problem is formulated by Novikova et al. as a the sampled locations superimposed, (b) - the initial transducer. The nodes of the transducer are maximum a posteriori (MAP) inference in a weighted finite-state transducer [99]. Images color-coded and correspond to detected characters. All characters that can be consecutive in the takenword from are[99] connected with transitions in the initial transducer. denotes the empty character.
content. Similarly, Mishra et al. [84] use a joint CRF model [59] to combine individual mapped into a single WFST. find the lowest-weight character detection results withWe a finally language model (lexicon).sequence of the resulting WFST that delivers the approximate MAP solution of (2). combines local likelihood and Novikova et al. [99] propose a probabilistic model which In general, consistency a weighted finite-state transducer (WFST) basedthat on a directed pairwise positional priors with higher order ispriors enforce multiconsistency graph, where each vertex corresponds to a state, and each arc is assigned a weight w, of characters and their attributes, such as font and colour. The word recognition is an input character ci from the alphabet AI and an output character co from the alphabet then formulated as a maximum a posteriori (MAP) inference and weighted finite-state AO (the notation (ci : co /w) is used further on). In this way, every path in the graph transducers are exploited to find the optimal solution. A weighted finite-state transdefines an input sequence of characters, an output sequence of characters, and a weight. ducer isOne a directed multigraph, each vertex corresponds a state, and each of the vertices of WFSTwhere is designated as a start vertex and a to subset of vertices are edge has a weight, an input character and an output character. To get the optimal solution, designated as end vertices, thus defining a set of valid paths going from the start to one the shortest path in the transducer that accepts the optimal location sequence and of the ends. produces the optimal word has to be an found Figure In this way, a WFST determines input(see language LI 2.14). and an output language LO , Yao defined et al. [142] mid-level character representation (strokelets) a partially as sets learn of character sequences corresponding to valid paths. Each validinpath is said to accept its input sequence LI and to transduce it toalgorithm the outputofLO . A WFST supervised manner, adapting the generic patch discovery Singh et al. [121]. is called deterministic, for each input-output sequence at most one In the recognition stage, a ifsliding window is shifted overpairs the there wordexists image, activations of valid path that performs the correspondent transduction. Each non-deterministic WFST different types of strokelets are tracked at each position, and the strokelet responses are A can be transformed to a deterministic WFST B with the same input and output lanbinned together to form a histogram feature vector, which is classified by a Random guages i. e. for each transduction LI into LO that is possible in A, the output B has a Forest single [15]. valid path with the same weight as the lowest-weight valid path performing this The transduction PhotoOCRinmethod Bissaco et is al.called [14],determinization, which won the ICDAR A. Such aoftransformation and2013 a lot of researchRobust Reading competition [53], a deep neuraldeterminization network classifier trained effort has gone into the exploits development of efficient algorithms that on can 2-level keep the run-time and the sizeimage of the output WFST small. HOG features [25]. Each word is first over-segmented into characters (or their fragment), each segment is classified by the NN classifier and finally Beam Search [114] with strong N-gram language mode is applied to find the optimal character sequence. The method also benefits from a large private training dataset with 107 training samples. Almaz´ an et al. [10] find a common embedding subspace for word images and their labels, so that they are close to each other in terms of Euclidian distance in the embedding space. Each label is first represented by a pyramidal histogram of characters representation, which embeds strings into a n-dimensional space. Then, an attribute model is trained to represent the word images and Canonical Correlation Regression is 19
2 Related Work IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Images
Attribute Embedding
I1
f : I ! RD
I
2
Label Embedding (PHOC)
Common Subspace d
:I!R = W T f (I)
I (I)
I Y
:I!R
d
: Y ! Rd
0
0
T I (I) = U T Y (Y) = V
I (I) Y (Y)
Y
: Y ! Rd
“hotel” Y1
f (I1 ) I (I1 )
I2
Label
I (I1 )
I (Y1 )
y (Y1 )
I (I2 )
f (I2 )
I (I2 )
Figure 1: Overview of the proposed method. Images are first projected into an attributes space with the Figure 2.15 Almaz´ an et al. encoded find a into common subspacewith forf .word andlabels their embedding function φI after being a base embedding feature representation At theimages same time, strings such so as that “hotel” are are embedded a label the same dimensionality labels, they close tointo each otherspace [10].ofImage taken from [10] using the embedding function φY . These two spaces, although similar, are not strictly comparable. Therefore, we project the embedded labels and attributes in a learned common subspace by minimizing a dissimilarity function F (I, Y; U, V ) = ||U T φI (I) − V T φY (Y)||22 = ||ψI (I) − ψY (Y)||22 . In this common subspace representations are comparable and labels and images that are relevant to each other are brought together. SCREENIE
OEX
SPACEII
LA
WALLEE
The advantages of this are twofold. First, it makes direct comparison between word images and text strings meaningful. Second, attribute scores of images CONGERATTOR of the I same word are brought together since they are guided by their shared PHOC representation. An overview of the method can be seen in Figure 1. In this work we propose to address the spotting By having text strings share a common A ILLS images and LAGE 3 CVD and recognitionSORT tasks by learning a commonUNSTER representation for word images and text strings. Using subspace with a defined metric, word spotting and this representation, spotting and recognition become recognition become a simple nearest neighbor probin a low-dimensionalARTETTE space. We can perform QBE simple nearest neighbor problems.SHAREEOSENIS We first propose lemCOIMIEN LENERER a label embedding approach for text labels inspired and QBS (or even a hybrid QBE+S, where both an by the bag of characters string kernels [18], [19] used image and its text label are provided as queries) using for example in the machine learning and biocom- exactly the same retrieval framework. The recognition HE FAOII of SEDED PELE becomes finding theALT nearest neighbor puting communities. The proposed approach LANDA embeds task simply word text dictionary first with Figure A1: Lexicon-free output predictions on space. non-alphanumeric-text images by in theaproposed Recursiveembedded Recurrent Nets text strings into a d−dimensional binary In a the image into the PHOC space and into the common Attention (R2 AM) framework. By directly operating on images without alphanumeric characters, we can nutshell, this embedding –which we dubbed pyramiFigureModeling 2.16 The model of Lee and Osindero [63] implicitly captures the then language model of see theour subspace. we use compact vectors, compression dal histogram of characters PHOC – encodes if model producesdata, output characters that are best fit toby theits underlying character-level language model implicitly learned from training which isordemonstrated output on Since images without any text. Images taken a particular character appears in a particular spatial and indexing techniques such as Product Quantizathe training data. from [63] region of the string (cf . Fig 2). Then, this embedding is tion [20] could now be used to perform spotting in very large datasets. To the best of our knowledge, we used as a source of character attributes: we will project fc10-4096 Accuracy vs CNN types word images into another d−dimensional space, more fc9-4096 96 fc10-4096 encodes howConv8-512 3x3 discriminative, where each dimension 94.2 furapplied to find thecontains final fc9-4096 common subspace (see94 Figure 2.15). Gordo [39] 3x3 likely that word image a particularembedding characterConv8-512 L1 beyond [ · · · ] m n o [ · ·93.3 ·] x y z a b c d e Conv8-512 3x3 Conv8-512 3x3 improves the embedding framework by exploiting mid-level features for character inther a particular 91.9 91.8 fc10-4096 region, in obvious parallelism with 92 Conv8-512 3x3 attributesConv7-512 3x3 [ · ·91.2 ·] m n o [···] x y z the PHOC descriptor. in By the learning character bey a b c d e fc9-4096 representations attribute model. 3x3 (since the MaxpoolL22x2 stride:2 independently, training data Conv7-512 is better used Conv8-512 3x3 90 89.1[ · · · ] [···] x y z ond Maxpool 2x2train stride:2 Conv6-256 3x3 m n o [137] b c d e sameLee training words are used to several atand Osindero [63] explore several Recurrent Neural aNetwork (RNN) models Conv7-512 3x3 Conv6-256 3x3 Conv6-256 3x3 88 87.3 tributes) and out of vocabulary (OOV) spotting and Maxpool 2x2 stride:2 with Long Short-Term Memory [43], built on top of a [Recursive · · · ] m n o [ · · CNN · ] x y z[69] be 3x3 at(LSTM) a b c d e recognition (i.e.,3x3 spotting andConv6-256 recognition test timeConv6-256 3x3 Conv6-256 86 Conv5-256 3x3 Conv5-256 3x3 with “soft” attention modelling [140] to filter features which are fed into the RNN. of words never3x3 observed during training) is straightConv5-256 yo [···] m n o [···] x y z a b c d e Maxpool 2x2 stride:2 (PHOCs MaxpoolL32x2 stride:2 forward. However, due to some differences The model can recognize word images without any supplied lexicon (unconstrained 84 Maxpool 2x2 stride:2 Conv4-128 3x3 are binary, while scores are not), directConv4-128 3x3 nd Conv4-128 3x3the attribute [···] m n o [···] x y z a b incorporated c d e text recognition), however the model is strongly in the RNN Conv4-128 3x3 language Conv4-128 3x3 82 comparison is 3x3 not optimal and some calibration is Conv3-128 Conv3-128 3x3 Conv4-128 3x3 Figure 2: PHOC histogram of a word at levels 1, 2, and 5 conv parameters, which is demonstrated by random predictions on images without any text needed. We2x2 finally propose to learn a low-dimensional Maxpool stride:2 Conv3-128 Maxpool 2x2 stride:2 3. The 3x3 final PHOC histogram 6 convis the concatenation of common subspace with an associated metric between Conv2-64 3x3 (see Figure 2.16). Maxpool 2x2 stride:2 Conv2-64 3x3 Conv1-64 3x3 the PHOC embedding and the attributes embedding. these partial histograms. 7 conv of very fine-grained, zero-shot classification, where we are interested in classifying a word image into (potentially) hundreds of thousands of classes, for HAYELOI WARE which weSPOmay not have seen any training example. The examples on Figs. 9 and 10 illustrate these issues.
Conv2-64 3x3 Conv2-64 3x3 8 conv (Base CNN) Shi et al . [119] train a fully-convolutional network with a bidirectional LSTM usConv1-64 3x3 Conv2-64 3x3 9 conv Base CNN ing the Connectionist Temporal Classification (CTC), which was first introduced by Conv2-64 3x3 8 conv (Recursive CNN 2 iters) Conv1-64 3x3 Recursive CNN Graves et al . [40] for speech recognition to eliminate the need for pre-segmented data. 8 conv (Recursive CNN 3 iters) (2 iters) Unlike the method presented in Section 6, Shi et al . [119] only recognize a single word Recursive CNN Performance on Synth90k (3 iters) per image (i.e. the output is always just one sequence of characters), they resize the Table A1: Left: network for matrix Base CNNofand the proposed untied recursive CNNs. Right: the barcharacters chart shows the source image to aarchitectures fixed-sized 100×32 pixels regardless of how many corresponding performance networksis with different depthsslower on Synth90k dataset.ofIn the this experiment we gradually increase it contains and thefor method significantly because LSTM layer.
the depth of the baseline CHAR model in [17] from 5 conv layers until we reach the performance plateau at 8 conv layers (denoted as Base CNN as our strong baseline). However, we can further boost the performance by using the proposed untied recursive CNNs. Notice that our recursive CNNs have the same number of parameters as Base CNN but achieve significantly 20 accuracy. better
10
2.5 Word Spotting
2.5 Word Spotting Due to the complexity of the general scene text recognition problem, some methods focus on a more constrained scenario when a relatively small lexicon of words is given with each image and the aim is to localize only the words present in the lexicon. This constrained scenario still has many interesting applications, such as local navigation system for the blind (where possible names of local businesses in the area are known based on current GPS position, but their exact location needs to be determined through a vision system). This task formulation, which we refer to as word spotting, was first introduced by Wang and Belongie [133]; in the most recent ICDAR 2015 Robust reading competition [52], the task is referred to as “End to End”, however in this thesis we will use the term “end to end” only in the context of the general scene text recognition problem. In the word spotting formulation, each image I is accompanied with a set of words L called the lexicon and the goal is to find all rectangular areas (x, y, w, h) whose content in the image I corresponds to a word of the lexicon. m I L −→ D = (xi , yi , wi , hi ) xi , yi , wi , hi ∈ N ∪ ∅ L ∈ (c1,1 , c1,2 , . . . , c1,n1 ), (c2,1 , c2,2 , . . . , c2,n2 ), . . . (cm,1 , cm,2 , . . . , cm,nm ) : ci,j ∈ A
(2.12) (2.13)
Note that not all words in the lexicon are present in the image (so-called “distractor” words, denoted as (∅, ∅, ∅, ∅)), and that not all words in the image have a corresponding content in the lexicon. In the evaluation, the image recall r and precision p are calculated as |M | |G| |M | p = |D| M = di ∈ D | m(di , G) = 1 r =
(2.14) (2.15) (2.16)
where m(di , G) is a match predicate which declares whether the detection di is matched to the image ground truth G. In the most recent ICDAR competition [52], a detection di is matched to the ground truth (m(di , G) = 1) if • the character sequence is identical (using case-insensitive comparison), and • the Intersection-over-Union [30] of the detected and the ground truth bounding box is above 50%. Wang and Belongie [133] detect and recognize individual characters using a multiscale sliding window approach. A Histograms of Oriented Gradient (HOG) features are calculated for each window position and a nearest neighbor classifier is used to measure distance between the patch and all character templates in all classes. Then, each word in the lexicon is considered and the cost of its character configuration is estimated using pictorial structures [33] that penalize disagreement with recognized labels and layout deformation. The method was further improved in [132], where the nearest neighbor classifier was replaced by Random Ferns [15] and an SVM classifier [24] is applied for word re-scoring in order to incorporate higher-order features of word configuration (e.g. 21
2 Related Work
L
Figure 2.17 Word spotting output example. Given given an image I with lexicon L (left), two words of the lexicon are localized in the image (right).
(a)
(b)
(c)
Figure 2: A schematics of the CNNs used showing the dimensions of the featuremaps at each stage for (a) dictionary encoding, (b)word character sequence encoding, and (c) The Figure 2.18 Three different output encodings proposed by bag-of-N-gram Jaderberg et al.encoding. [45] - dictiosame architecture is usedencoding for all three naryfive-layer, encodingbase (a),CNN character-at-position (b)models. and character N-gram encoding (c). Images taken from [45] with one class per word. While the dictionary W of a natural language may seem too large for this approach to be feasible, in practice an advanced English vocabulary, including different word forms, contains only around 90k words, which is largeSimilarly, but manageable. standard deviation of character spacing). the method of Wang et al. [134] uses
aInsliding-window combined with convolutional classifiers. It detail, we proposeapproach to use a CNN classifier where each word w neural ∈ W innetwork the lexicon corresponds an output neuron. Wethe use method a CNN with four convolutional layers two fullydistorted connectedimages, layers. isto demonstrated that performs well on noisy or and otherwise Rectified linear units are used throughoutdecreases after eachwith weight layer except forof thethe lastlexicon. one. In forward however the accuracy significantly increasing size order, the convolutional layers have 64, 128, 256, and 512 square filters with an edge size of 5, et al. [49] train character-centric Neural Network (CNN) [61], 5, Jaderberg 3, and 3. Convolutions are aperformed with strideConvolutional 1 and there is input feature map padding to which takes a dimensionality. 24 × 24 image2 patch and predicts a text/no-text score, a character and preserve spatial × 2 max-pooling follows the first, second and third convolutional Theclass. fully connected has is 4096 units, and feedstrained data to the final fully connected layera alayers. bigram The inputlayer image scanned by the network in 16 scales and whichsaliency performsmap classification, so by hastaking the same of units output as the size of the dictionary we text is obtained thenumber text/no-text of the network. Given wish to recognize. The predicted word recognition result w∗ out of the set of all dictionary words the saliency maps, word bounding boxes are then obtained by the run length smoothW in a language L for a given input image x is given by w∗ = arg maxw∈W P (w|x, L). Since ing algorithm. Finally, each word is recognized independently by taking the cropped (w|L)P (x) P (w|x, L) = P (w|x)P and with the assumptions that x is independent of L and that prior P (x|L)P (w) word image and finding the optimal label sequence through the character and bigram to any knowledge of our language all words are equally probable, our scoring function reduces to classifier byP dynamic programming. w∗ = argoutput maxw∈W (w|x)P (w|L). The per-word output probability P (w|x) is modelled by the softmax scaling of final improved fully connected layer, and the language based word prior P (w|L) can The method is the further in [47], where a word-centric approach is introduced. be modelled by a lexicon or frequencyproposals counts. A are schematic of theby network is shownthe in Fig. 2 (a). of First, horizontal bounding-box detected aggregating output the standard EdgetheBoxes [150] and Aggregate the Channel Feature [27] logistic detectors. Each Training. We train network by back-propagating standard multinomial regression loss with dropout which by improves generalization. Optimization stochastic proposal is then [10], classified a Random Forest [15] classifier uses to reduce the gradient numberdeof scent positives (SGD), dynamically loweringand the learning rate as training uniformto sampling false and its position size is further refinedprogresses. by a CNNWith regressor, obtain of classes in training data, we found the SGD batch size must be at least a fifth of the total number of classes in order for the network to train.
22 For very large numbers of classes (i.e. over 5k classes), the SGD batch size required to train effec-
tively becomes large, slowing down training a lot. Therefore, for large dictionaries, we perform incremental training to avoid requiring a prohibitively large batch size. This involves initially training the network with 5k classes until partial convergence, after which an extra 5k classes are added. The original weights are copied for the original 5k classes, with the new classification layer weights being randomly initialized. The network is then allowed to continue training, with the extra randomly
2.5 Word Spotting a more suitable cropping of the detected word image. Each cropped word image is then resized to a fixed size of 32 × 100 pixels and classified as one of the words in the dictionary (see Figure 2.18a), where in their setup the dictionary contains ∼ 90 000 English words (plus words of the testing set, see Section 3.7). The classifier is trained on a dataset of 9 million synthetic word images uniformly sampled from the dictionary. As the last step, duplicate and overlapping detections are eliminated by Non-Maxima Suppression and the bounding boxes are further refined by the CNN regressor in an iterative manner to obtain the final word positioning. The requirement of a dictionary is relaxed in [46], where an unconstrained text recognition architecture is presented. The model either outputs a character class (or an empty label) for every of 23 possible character positions in the word (see Figure 2.18b), or it outputs a set of character N-grams, where each N-gram is a substring of the word up to 4 characters in length (see Figure 2.18c).
23
3 Datasets 3.1 Chars74K Dataset The Chars74K dataset [1] was collected by de Campos et al. [26]. It contains 7705 English character images (A-Z, a-z and 0-9, in total 64 classes) and 3345 Kannada character images (647 classes), which were manually segmented from 1922 scene text images (see Figure 3.1). Additionally, the dataset contains 1416 scene images with each word annotated by a polygon and its text transcription, however not every word in the dataset is annotated.
(a)
(b)
(c) Figure 3.1 Samples from the Chars74K dataset. Individual English characters (a), Kannada characters (b) and full images (c)
3.2 ICDAR 2003 Dataset The ICDAR 2003 dataset [77] was created by Simon Lucas and his colleagues for the ICDAR 2003 Robust Reading competition [77]. The dataset was used in an unchanged 24
3.3 ICDAR 2011 Dataset
Figure 3.2 A sample of images used in the ICDAR 2005 and ICDAR 2011 datasets. Text in the images is mostly horizontal, it occupies a large portion of an image and it typically is present in the middle of an image.
form in the ICDAR 2005 Robust Reading competition [76], which is why in the literature sometimes the dataset is also referred to as the ICDAR 2005 dataset. The dataset contains 258 training and 251 testing images with words and characters annotated by bounding boxes and their text content. 1157 word and 6185 character images (1111 word and 5430 character images) were subsequently cropped from the training (respectively testing) image set, to be used in Cropped Word Recognition (Section 2.4) and Cropped Character Recognition (Section 2.3) evaluation. The dataset was captured by people who were specifically tasked to capture text in an outdoor environment, so as a result text in the dataset is mostly horizontal, it occupies a large portion of an image and it typically is present in the middle of an image [129], since the authors of the pictures tried to capture “nice” pictures of text (see Figure 3.2).
3.3 ICDAR 2011 Dataset The ICDAR 2011 dataset [2] was created by taking all images from the ICDAR 2003 dataset (see Section 3.2), removing images with no text, adding several new images and splitting them again into a training and a testing subset. The dataset was first used in the ICDAR 2011 Robust Reading Competition [118] and then subsequently in the 2013 competition [53], which is why it is sometimes referred to as the ICDAR 2013 dataset. In the ICDAR 2015 Robust Reading competition [52], the dataset was used again in the Focused Scene Text challenge (Challenge 2). The dataset contains 229 training and 255 testing images, with corresponding 849 training and 716 testing cropped word images. As a result of the creation process, the testing subset of the ICDAR 2011 dataset contains the same images as the training 25
3 Datasets
ICDAR 2003 training set
ICDAR 2011 testing set
Figure 3.3 The same images are present in the ICDAR 2003 training set (left) and the ICDAR 2011 testing set (right), which is why training on both ICDAR 2003 and 2011 training sets and then evaluating on the ICDAR 2011 testing set is not possible - a common mistake in the literature
subset of the ICDAR 2003 dataset. This unfortunately often leads to evaluation problems in the literature, where some methods are trained on both ICDAR 2003 and 2011 training sets, falsely assuming they are different datasets, and evaluated on the 2011 testing set - but the testing set contains many images from the joint training set, and therefore the accuracy evaluation is heavily skewed (see Figure 3.3).
3.4 ICDAR 2015 Dataset The ICDAR 2015 dataset [2] was introduced in the ICDAR 2015 Robust Reading Competition [52] to address the problems of the ICDAR 2003/2011 datasets (see Sections 3.2 and 3.3). The dataset is used in the Incidental Scene Text challenge (Challenge 4).
Figure 3.4 Sample images from the ICDAR 2015 dataset (also known as Incidental Scene Text challenge). Many realistic effects such as occlusion, perspective distortion, blur or noise are present.
The images were collected by people wearing Google Glass devices [138] and walking in Singapore, and then subsequently by selecting and annotating only images with text. 26
3.5 Street View Text Dataset The images in the dataset were taken “not having text in mind”, and therefore contain a high variability of text fonts and sizes and they include many realistic effects - e.g. occlusion, perspective distortion, blur or noise (see Figure 3.4). The dataset contains 1670 images with 17548 annotated words - 1500 images are publicly available, split into training and testing set, and the remaining 170 images represent a sequestered set for a future use. Each word is annotated by a quadrilateral (3 points) and its Unicode transcription, thus supporting rotated and slanted text.
3.5 Street View Text Dataset
Figure 3.5 Sample images from the Street View Text dataset.
The Street View Text (SVT) dataset [3] was published by Wang and Belongie [133], where the data was collected by asking annotators to find images with local businesses in the Google Street View application. The annotators were instructed to find a representative text associated with the business in the image, then to move the view point in the application to minimize the skew of the text and finally to save the screen shot. The dataset therefore contains mostly business names and business signs (see Figure 3.5), and the business names can be typically obtained in publicly available dictionaries by looking up businesses close to the GPS position of the image. The words in the image picked by the annotators in this process are tagged by a horizontal bounding-box and a case-insensitive transcription. Note that obviously as a result only a small fraction of text in the images is labelled. Each image is also associated with a lexicon of 50 unique words, which contains the words tagged by the annotator, as well as words from business names present near the location the image was taken. In total, the data set contains 350 images (100 training and 250 testing images) of 20 different cities and 725 labeled words. The word annotations were also exploited to create the dataset of cropped words SVT-50, which contains 647 word images, each with a lexicon of 50 words. There is also a lexicon of all test words (4282 words), which is referred to as SVT-FULL [134]. 27
annotators the results are not a baseline.
6. Dataset Split The dataset is split into training and validation set, which To report end-to-end text spotting results only legible machine printed and handwritten text should be considered. We encourage researchers to train and printed adjust parameters on the training legible - machine set, but minimize the number of runs on the evaluation set.
contain 43686 and 20000 images respectively. 3 Datasets
not a competition, they are shows the evaluation results. A and B have good detection 8% respectively. Further, we racy. In particular, method A etection performance is very A finds considerable amounts with 34.99%, no method perresults are observed on leginsatisfactory detection results -Text motivate future work. iable functionality to find ilnote that current photo OCR o detect or transcribe illegible ires novel methods to fill this e approaches are used in our sure redundancy with human baseline.
ning and validation set, which ages respectively. To report s only legible machine printed e considered. We encourage t parameters on the training of runs on the evaluation set.
illegible - machine printed
7. Discussion We introduced COCO-Text2 a new dataset for detecting and recognizing text in natural images to support the advancement of text recognition in everyday life environments. Using over 1500 worker hours, we annotated a large collection of text instances spanning several types of text. This is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. Dataset statistics indicate the images contain a wide variety of text and the legible - handwritten spatial distribution of text is broader than in related datasets. We further evaluate state-of-the-art photo OCR algorithms on our dataset. While the results indicate satisfactory precision, we identify significant shortcomings especially for detection recall. This motivates future work towards algorithms that can detect wider varieties of text. We believe this dataset will be a valuable resource supporting this effort.
illegible - handwritten
Acknowledgments Funding for the crowd worker tasks was provided by a Microsoft Research Award. We would like to thank our collaborators Kai Wang from Google and Ankush Gupta and Andrew Zisserman from Oxford for providing photo illegible - machine printed
Figure 6. Crops around example text instances in COCO-Text organized by text categories.
2 available at http://vision.cornell.edu/se3/coco-text Figure 3.6 Samples from the COCO-Text dataset and their legibility classification and text category. Images taken from [129]
7 t2 a new dataset for detectatural images to support the on in everyday life environer hours, we annotated a large anning several types of text. ataset for text in natural imo annotate scene text withThe at- COCO-Text dataset [4, 129] with its 63 686 images and 173 589 annotated words is ype of text. Dataset statistics the largest scene text dataset to date. The COCO-Text dataset was as the name sugwide variety of text and the gests based on the MSCOCO dataset [71] and each word is associated with a horizontal oader than in related datasets. illegible - handwritten he-art photo OCR algorithms bounding box, a legibility classification (legible/illegible), a text category (machineults indicate satisfactory prehand-written, etc.), a script (Latin/other) and a Unicode transcription in case shortcomings especially printed, for es future work towards algoof a Latin script. The original object categories from MSCOCO are available as well. rieties of text. We believe this urce supporting this effort. Since the images in the dataset were not collected with text in mind, the variety of
3.6 COCO-Text Dataset
text and its hidden parameters (font, style, script, positioning, etc.) is higher than in any of the existing datasets (see Figure 3.6).
rker tasks was provided by a We would like to thank our Google and Ankush Gupta Oxford for providing photo
du/se3/coco-text
6. Crops around example text instances in COCO-Text 3.7 Figure MJSynth Dataset organized by text categories.
In order to address the problem of small volume of training data to apply deep-learning 7 methods, Jaderberg et al. [45] created a synthetic dataset of 9 million word images called MJSynth [5] (also referred to as the Synth90k dataset). The dataset is based on a set 50 000 English words from the Hunspell dictionary [6], augmented with all their suffixes and prefixes, and all the words from the testing subsets of the ICDAR 2011 (Section 3.3), SVT (Section 3.5) and IIIT5k (Section 3.9) datasets. The need to include words from the testing set is driven by the word-centric approach to classify a word image into 1 of 90 000 classes (only words from the training dictionary can be outputted [47]), however as a result there is not a clear separation between training and testing data, and in the testing phase the system trained on the synthetic data has already “seen” all the words in the testing set, although with a different appearance. Each word from the aforementioned dictionary is then rendered 100 times into a synthetic word image, each time randomly applying different font, border rendering, 28
3.8 SynthText Dataset
(a)
(b) A.2. Poisson Editing vs. Alpha Blending Comparison between simple alpha blending (bottom row) and Poisson Editing [35] (top row).
Poisson Editing
FigurePoisson 1: (a) The text generation process and after font rendering, creating and coloring the imagepreserves local illumination texture details. Figure Editing 3.7 The rendering process gradient of the MJSynth dataset [45] (a). Samples from the MJSynth layers, applying projective distortions, and after image blending. (b) Some randomly sampled data Images taken from [45] createddataset by the (b). synthetic text engine.
Alpha Blending
trained to recognise a very large number of words using incremental training. While our lexicon is base colouring, perspective distortion, natural data blending (using a random patch restricted, it is so large that this hardly constitutes a practical limitation. Secondly, we show that this from the SVT and ICDAR datasets) and synthetic noise (seedata. Figure generating state-of-the-art recogniser can be 2003 trained purely from This3.7a), resultthus is highly non-trivial 9 million word images. The redendered data were then randomly divided 7.2M data as, differently from CAPTCHA, the classifier is then applied to real images. Whileinto synthetic training, 900k validation and 900k testing images. was used previously for OCR, it is remarkable that this can be done for scene text, which is significantly less constrained. This allows our framework to be seamlessly extended to larger vocabularies and other languages without any human-labelling cost. In addition to these two key contributions, we study alternative models – a character sequence encoding model with a modified formulation 3.8 two SynthText Dataset to that of [8] (Sect. 3.2), and a novel bag-of-N-grams encoding model which predicts the unordered set ofThe N-grams contained in the image SynthText dataset [7]word created by (Sect. Gupta3.3). et al. [41] contains 800 000 synthetically created images with text. The immediately background and image randomly taken fromdescribed a set ofafter A discussion of related work follows our isdata generation system 8 0002.background images downloaded from Google Imageare Search, and text (word up in Sect. Our deep learning word recognition architectures presented in Sect. 3, or evaluated to 34,text lines) extracted from in the Newsgroup20 dataset is rendered into the image, in Sect. and conclusions are drawn Sect. 5. respecting local texture and geometry cues (see Figure 3.8). The dataset therefore Related Traditional A.3.work. SynthText in the Wildtext recognition methods are based on sequential character classification contain highly realistic scene text images with full annotations, placement by either sliding windows [11, 26, 27] or connected [18, although 19], after the which a word preSample images from our synthetic text dataset (continued on thecomponents next page). of the text does not respect priors of the real word for example, text does not typically diction is images made show by grouping classifier predictions left-to-right manner. The sliding These text instances character in various fonts, colours, sizes, with bordersin andashadows, against different backgrounds, appear on horse hair (Figure 3.8, and transformed according to the local geometry and constrained to local of colour in and[11, text. Ground-truth window classifiers include random fernsbottom [22] in right). Wang et contiguous al. [26],regions and CNNs 27]. Both [26] word bounding-boxes are marked in red. and [27] use a small fixed lexicon as a language model to constrain word recognition. More recent works such as [2, 3, 20] make use of over-segmentation methods, guided by a supervised classifier, to generate candidate proposals which are subsequently classified as characters or false positives. For example, PhotoOCR [3] uses binarization and a sliding window classifier to generate candidate character regions, with words recognised through a beam search driven by classifier scores followed by a re-ranking using a dictionary of 100k words. [11] uses the convolutional nature of CNNs to generate response maps for characters and bigrams which are integrated to score lexicon words. In contrast to these approaches based on character classification, the work by [7, 17, 21, 24] instead uses the notion of holistic word recognition. [17, 21] still rely on explicit character classifiers, but construct a graph to infer the word, pooling together the full word evidence. Rodriguez et al. [24] use aggregated Fisher Vectors [23] and a Structured SVM framework to create a joint word-image and text embedding. [7] use whole word-image features to recognize words by comparing to simple black-and-white font-renderings of lexicon words. Goodfellow et al. [8] had great success using a CNN with multiple position-sensitive character classifier outputs (closely related to the character sequence model in Sect. 3.2) to perform street number Figure 3.8 from thetoSynthText dataset, the ground depictedlong) by red rect- they recognition. ThisSample modelimages was extended CAPTCHA sequences (up totruth 8 characters where angles. Images taken from [41] demonstrated impressive performance using synthetic training data for a synthetic problem (where 12 the generative model is known), but we show that synthetic training data can be used for a real-world data problem (where the generative model is unknown).
2
Synthetic Data Engine
29
This section describes our scene text rendering algorithm. As our CNN models take whole word images as input instead of individual character images, it is essential to have access to a training dataset of cropped word images that covers the whole language or at least a target lexicon. While
3 Datasets
3.9 IIIT Datasets The IIIT 5k Word (IIIT5k) dataset [8, 82] contains 5 000 (2 000 training and 3 000 testing) word images cropped from images with text found on the Internet. Each word image has a case-insensitive transcription and a lexicon of 50 (IIIT5k-50 ) or 1 000 (IIIT5k-1k ) words. The IIIT Scene Text Retrieval dataset [83] consists of 10 000 images downloaded from Flickr. There are 50 text query words, and each word is associated with a list of 10 to 50 images, which contain the word. There are also many distractor images with no text at all. Note that this dataset does not contain any text localization information, so it can only be applied for text retrieval tasks.
3.10 MSRA-TD500 Dataset The MSRA-TD500 dataset [141] consists of 500 scene text images, split into 300 training and 200 testing samples. The dataset contains English as well as Chinese text in different orientations and text is annotated on a text-line level, where each text line is associated with a rotated rectangular bounding box.
3.11 KAIST Dataset The KAIST dataset [65] contains 3 000 images of indoor and outdoor scenes with text. Words and characters are annotated by a bounding box, and character segmentations are provided as well. The dataset contains text in English and Korean, however not all text instances are annotated.
3.12 NEOCR Dataset The Natural Environment OCR dataset (NEOCR) was introduced by Nagy et al. [87] and it contains 659 real world images with 5238 text line annotations. Each text line is annotated by a quadrilateral, its Unicode content and several additional attributes such as language, type face, occlusion level or noise level.
30
4 Text Recognition by Extremal Regions 4.1 Character Detection 4.1.1 Extremal Regions Let us consider an image I as a mapping I : D ⊂ N2 → V, where V typically is {0, . . . , 255}3 (a color image). A channel C of the image I is a mapping C : D → S where S is a totally ordered set and fc : V → S is a projection of pixel values to a totally ordered set. Let A denote an adjacency (neighborhood) relation A ⊂ D × D. In our method we consider 4-connected pixels, i.e. pixels with coordinates (x ± 1, y) and (x, y ± 1) are adjacent to the pixel (x, y). Region R of an image I (or a channel C) is a contiguous subset of D ∀pi , pj ∈ R ∃pi , q1 , q2 , . . . , qn , pj : pi Aq1 , q1 Aq2 , . . . , qn Apj
(4.1)
Outer region boundary ∂R is a set of pixels adjacent but not belonging to R ∂R = {p ∈ D \ R : ∃q ∈ R : pAq}
(4.2)
Extremal Region (ER) is a region whose outer boundary pixels have strictly higher values than the region itself ∀p ∈ R, q ∈ ∂R : C(q) > θ ≥ C(p)
(4.3)
where θ denotes threshold of the Extremal Region (see Figure 4.2). We consider RGB and HSI color spaces [21] and additionally an intensity gradient channel (∇) where each pixel is assigned the value of “gradient” approximated by the maximal intensity difference between the pixel and its neighbors (see Figure 4.1): C∇ (p) = max |CI (p) − CI (q)| (4.4) q∈D : pAq
In real-world images there are certain instances where characters are formed of smaller elements (see Figure 4.3 left) or a single element consists of multiple joint characters (see Figure 4.3 right). By pre-processing the image in a Gaussian pyramid, in each
(a)
(b)
(c)
Figure 4.1 Intensity gradient magnitude channel ∇. (a) Source image. (b) Projection output. (c) Extremal Regions at threshold θ = 24 (ERs bigger than 30% of the image area excluded for better visualization)
31
4 Text Recognition by Extremal Regions
98
37
24
244 240
36
36
36
36
36
37
27
25
29
37
36
35
27
32 R 36
40
39
38
34
21
δR
θ = 32 Figure 4.2 Extremal Region R is a region whose outer boundary pixels ∂R have strictly higher values than pixels of the region itself. θ denotes threshold of the Extremal Region
s=1
s=1
s = 1/8
s = 1/4
Figure 4.3 Processing with a Gaussian pyramid (the pyramid scale denoted by s). Characters formed of multiple small regions merge together into a single region (left column). A single region which corresponds to characters “ME” is broken into two regions and serifs are eliminated (right column)
level of the pyramid only a certain interval of character stroke widths is amplified if a character consists of multiple elements, the elements are merged together into a single region and furthermore, serifs and thin joints between multiple characters are eliminated. This does not represent a major overhead as each level is 4 times faster than the previous one.
4.1.2 Incrementally Computable Descriptors The key prerequisite for fast classification of ERs is a fast computation of region descriptors that serve as features for the classifier. An ER r at threshold θ is formed as a union of one or more (or none) ERs at threshold θ − 1 and pixels of value θ. This induces an inclusion relation (see Figure 4.4) amongst ERs where a single ER has one or more predecessor ERs (or no predecessor if it contains only pixels of a single value) and exactly one successor ER (the ultimate successor is the ER at threshold 255 which contains all pixels in the image). As proposed by 32
4.1 Character Detection
Figure 4.4 Extremal Region (ER) lattice induced by the inclusion relation.
Zimmerman and Matas [81], it is possible to use a particular class of descriptors and exploit the inclusion relation between ERs to incrementally compute descriptor values. Let Rθ−1 denote a set of ERs at threshold θ − 1. An ER r ∈ Rθ at threshold θ is formed as a union of pixels of regions at threshold θ − 1 and pixels of value θ, [ [ p ∈ D : C(p) = θ (4.5) r= u ∈ Rθ−1 ∪
Let us further assume that descriptors φ(u) of all ERs at threshold u ∈ Rθ−1 are already known. In order to compute a descriptor φ(r) of the region r ∈ Rθ it is necessary to combine descriptors of regions u ∈ Rθ−1 and pixels {p ∈ D : C(p) = θ} that formed the region r, φ(r) = ⊕ φ(u) ⊕ ⊕ ψ(p) (4.6)
where ⊕ denotes an operation that combines descriptors of the regions (pixels) and ψ(p) denotes an initialization function that computes the descriptor for given pixel p. We refer to such descriptors where ψ(p) and ⊕ exist as incrementally computable (see Figure 4.5). It is apparent that one can compute descriptors of all ERs simply by sequentially increasing threshold θ from 0 to 255, calculating descriptors ψ for pixels added at threshold θ and reusing the descriptors of regions φ at threshold θ − 1. Note that the property implies that it is necessary to only keep descriptors from the previous threshold in the memory and that the ER method has a significantly smaller memory footprint when compared with MSER-based approaches. Moreover if it is assumed that the descriptor computation for a single pixel ψ(p) and the combining operation ⊕ has constant time complexity, the resulting complexity of computing descriptors of all ERs in an image of N pixels is O(N ), because φ(p) is computed for each pixel just once and combining function can be evaluated at most N times, because the number of ERs is bound by the number of pixels in the image. In this method we used the following incrementally computed descriptors: Area a. Area (i.e. number of pixels) of a region. The initialization function is a constant function ψ(p) = 1 and the combining operation ⊕ is an addition (+). Bounding box (xmin , ymin , xmax , ymax ). Top-right and bottom-left corner of the region. The initialization function of a pixel p with coordinates (x, y) is a quadruple (x, y, x + 1, y + 1) and the combining operation ⊕ is (min, min, max, max) where each 33
4 Text Recognition by Extremal Regions
Ψ(p1)=0
p1
Ψ(p1)=¼((-1)-(+1)+2(-1))=-1
φ(r1)=12
p1
p2 Ψ(p2)=2
r1
r2
p3
φ(r1)=1
r3
r2 r1
p2
Ψ(p3)=0 φ(r2)=9
Ψ(p2)=¼((+1)-(+1)+0)=0 φ(r2)=1-1+0=0
φ(r3)=12+9+0+2+0=23
(a) perimeter p
(b) Euler number η φ(r1)=(2,2)
Ψ(p1)=2
φ(r2)=(2,2,4,2)
p1
p2
r1
Ψ(p2)=0
r2 p3
r3
Ψ(p3)=-2
φ(r3)=(0,2,2,0)+(2,2,4,2)+(2,0,0,0)+ (0,0,0,0)+(0,0,-2,0)=(4,4,4,2)
(c) Horizontal crossings ci Figure 4.5 Incrementally computable descriptors. Regions already existing at threshold θ − 1 marked grey, new pixels at threshold θ marked red, the resulting region at threshold θ outlined with a dashed line
operation is applied to its respective item in the quadruple. The width w and height h of the region is calculated as xmax − xmin and ymax − ymin respectively. Perimeter p. The length of the boundary of the region (see Figure 4.5a). The initialization function ψ(p) determines a change of the perimeter length by the pixel p at the threshold where it is added ψ(p) = 4 − 2|{q : qAp ∧ C(q) ≤ C(p)}|
(4.7)
and the combining operation ⊕ is an addition (+). The complexity of ψ(p) is O(1), because each pixel has at most 4 neighbors. Euler number η. Euler number (genus) is a topological feature of a binary image which is the difference between the number of connected components and the number of holes. A very efficient yet simple algorithm [106] calculates the Euler number by counting 2 × 2 pixel patterns called quads. Consider the following patterns of a binary image: 1 0 0 1 0 0 0 0 Q1 = , , , (4.8) 0 0 0 0 0 1 1 0 0 1 1 0 1 1 1 1 Q2 = , , , (4.9) 1 1 1 1 1 0 0 1 0 1 1 0 Q3 = , (4.10) 1 0 0 1 Euler number is then calculated as η= 34
1 (C1 − C2 + 2C3 )) 4
(4.11)
4.1 Character Detection where C1 , C2 and C3 denote number of quads Q1 , Q2 and Q3 respectively in the image. It follows that the algorithm can be exploited for incremental computation by simply counting the change in the number of quads in the image. The value of the initialization function ψ(p) is determined by the change in the number of the quads Q1 , Q2 and Q3 by changing the value of the pixel p from 0 to 1 at given threshold C(p) (see Figure 4.5b), 1 ψ(p) = (∆C1 − ∆C2 + 2∆C3 )) (4.12) 4 The complexity of ψ(p) is O(1), because each pixel is present in at most 4 quads. The combining operation ⊕ is an addition (+). Horizontal crossings ci . A vector (of length h) with number of transitions between pixels belonging (p ∈ r) and not belonging (p ∈ / r) to the region in given row i of the region r (see Figure 4.5c and 4.8). The value of the initialization function is given by the presence/absence of left and right neighboring pixels of the pixel p at the threshold C(p). The combining operation ⊕ is an element-wise addition (+) which aligns the vectors so that the elements correspond to same rows. The computation complexity of ψ(p) is constant (each pixel has at most 2 neighbors in the horizontal direction) and the element-wise addition has constant complexity as well assuming that a data structure with O(1) random access and insertion at both ends (e.g. double-ended queue in a growing array) is used.
4.1.3 Sequential Classifier In the proposed method, each channel is processed separately over a coarse Gaussian pyramid (in the original and inverted projections) and ERs are detected. In order to reduce the high false positive rate and the high redundancy of the ER detector, only distinctive ERs which correspond to characters are selected by a sequential classifier. The classification is broken down into two stages for better computational efficiency (see Figure 4.6). In the first stage, a threshold is increased step by step from 0 to 255, incrementally computable descriptors (see Section 4.1.2) are computed in O(1) for each ER r and the descriptors are used as features for a classifier which estimates the class-conditional probability p(character|r). The value of p(character|r) is tracked using the inclusion relation of ER across all thresholds (see Figure 4.7) and only the ERs which correspond to local maximum of the probability p(character|r) are selected (if the local maximum of the probability is above a global limit pmin and the difference between local maximum and local minimum is greater than ∆min ). A Real AdaBoost [116] classifier with decision trees was used with the following features (calculated in O(1) from incrementally computed descriptors): aspect ratio √ (w/h), compactness ( a/p), number of holes (1 − η) and a horizontal crossings feature (ˆ c = median {c 1 w , c 3 w , c 5 w }) which estimates number of character strokes in horizon6 6 6 tal projection - see Figure 4.8. Only a fixed-size subset of c is sampled so that the computation has a constant complexity. The output of the classifier is calibrated to a probability function p(character|r) using Logistic Correction [98]. The parameters were set experimentally to pmin = 0.2 and ∆min = 0.1 to obtain a high value of recall (95.6%) (see Figure 4.9). In the second stage, the ERs that passed the first stage are classified into character and non-character classes using more informative but also more computationally expensive features. In our method, an SVM [24] classifier with the RBF kernel [86] was used, the parameter values σ and C were found by cross-validation on the training set. The 35
4 Text Recognition by Extremal Regions
Source 2MP image
Intensity channel extracted
ERs selected in O(N ) by the first stage of the sequential classifier
ERs selected by the second stage of the classifier
Text lines formed
Only ERs in text lines selected and labelled by a character classifier run time No. of ERs 6 × 106 820ms 2671 130ms 40 20ms 25 110ms 25 15ms 12
Initial image Char. det. (1st stage) Char. det. (2nd stage) Text line formation Character Recognition Sequence Selection
Figure 4.6 Typical number of regions and timings in each stage (character detection in a single channel only) on a standard 2GHz PC
36
p(character|r)
4.1 Character Detection
θ
(a)
(b)
Figure 4.7 In the first stage of the sequential classification the probability p(character|r) of each ER is estimated using incrementally computable descriptors that exploit the inclusion relation of ERs. (a) A source image cut-out and the initial seed of the ER inclusion sequence (marked with a red cross). (b) The value of p(character|r) in the inclusion sequence, ERs passed to the second stage marked red
2 6
4
4
4
2
4
4 ĉ=4
2 ĉ=4
ĉ=2
Figure 4.8 The horizontal crossings feature used in the 1st stage of ER classification
1
precision
0.9 0.8 0.7 0.6 0.5 0.4
0
0.2
0.4 0.6 recall
0.8
1
Figure 4.9 The precision-recall curve of the first stage of the sequential classifier obtained by cross-validation. The configuration used in the experiments marked red (recall 95.6%, precision 67.3%)
37
4 Text Recognition by Extremal Regions
(a)
(b)
(c)
Figure 4.10 The features used by the sequential character classifier allow detection of characters in various scripts - Armenian (a), Russian (b) and Kannada (c). Note that the training set contains only Latin characters.
κ=0 κ=5 κ=6 (a)
κ = 14
κ = 15 κ = 93 (b)
Figure 4.11 The number of boundary inflexion points κ. (a) Characters. (b) Non-textual content
classifier uses all the features calculated in the first stage and the following additional features: • Hole area ratio. ah /a where ah denotes number of pixels of region holes. This feature is more informative than just the number of holes (used in the first stage) as small holes in a much larger region have lower significance than large holes in a region of comparable size. • Convex hull ratio. ac /a where ac denotes the area of the convex hull of the region. • The number of outer boundary inflexion points κ. The number of changes between concave and convex angle between pixels around the region border (see Figure 4.11). A character typically has only a limited number of inflexion points (κ < 10), whereas regions that correspond to non-textual content such as grass or pictograms have boundary with many spikes and thus more inflexion points. Let us note that all features are scale-invariant, but not all are rotation-invariant namely the aspect ratio and the horizontal crossings. They also enable detecting characters of different scripts, even though the training set only contains Latin alphabet (see Figure 4.10). The features are also somewhat robust against small rotations (approx. ±15◦ ), so text of multiple orientations can be detected by simply rotating the input image 6 times, at the cost of only a slightly lower precision (see Section 4.3.2). 38
4.2 Text Line Formation
4.2 Text Line Formation Let R denote the set of regions (character candidates) in all channels and scales detected in the previous stage. Even though the cardinality of R (and subsequently R) is linear in number of pixels, the cardinality of the search space of all sequences is still exponential n (the complexity has decreased from 22 to 2n only). In the proposed method, the space of all sequences is searched by effectively finding all region triplets that could correspond to a (sub-)sequence of characters. Such triplets are then formed into text lines by an agglomerative clustering approach, which exploits bottom line estimates and additional typographical constraints as the distance measure between individual clusters. Data: a set of regions R Result: a set of triplets T T ←− ∅; for r1 ∈ R do for r2 ∈ N (r1 ) do if v {r1 , r2 } = 0 then continue; end for r3 ∈ N (r2 ) do if v {r2 , r3 } = 0 then continue; end t ←− {r1 , r2 , r3 }; if v 0 (t) = 1 then T ←− T ∪ t; end end end end
Algorithm 1: Exhaustive enumeration of region pairs and triplets to form initial text line candidates In the first step of the text line formation (see Algorithm 1), character candidates r1 ∈ R are exhaustively enumerated and region pairs and triplets are formed by considering region’s r1 neighbours r2 ∈ N (r1 ) and neighbours of the neighbours r3 ∈ N (r2 ). In our method, a region r2 is considered a neighbour of r1 r2 ∈ N (r1 ) , if r2 is amongst K = 5 closest regions to r1 , where the distance of two regions is measured as the distance of their centroids. Additionally, a left-to-right direction of the text is enforced by limiting the set r2 ∈ N (r1 ) to regions r2 whose centroid is to the right of the centroid of r1 , i.e. cx (r2 ) > cx (r1 ) where cx (r) denotes the x-coordinate of the region’s r centroid. In the exhaustive search, region pairs (r1 , r2 ) and (r2 , r3 ) and region triplets (r1 , r2 , r3 ) are pruned by constraints v (respectively v 0 ), which verify that the region pair (resp. region triplet) corresponds to the trained typographical model and which ensure the exhaustive search does not combinatorially explode. In our method both constraints are implemented as an AdaBoost classifier [116]; the binary constraint v uses height ratio and region distance normalized by region width as features, whilst the ternary constraint v 0 uses distance from the bottom line normalized by text line height and region centroid angle. In our experiments, the classifiers were trained on the ICDAR 2013 Training set. In the second step (see Algorithm 2), each triplet is first turned into a text line (of a 39
4 Text Recognition by Extremal Regions
Data: a set of triplets T Result: a set of text lines L // Estimate parameters of each triplet and create an initial set of text lines
L ←− ∅; for {r1 , r2 , r3 } ∈ T do b ←− bottom line estimate of {r1 , r2 , r3 }; x ←− minimal x-coordinate of {r1 , r2 , r3 }; x ←− maximal x-coordinate of {r1 , r2 , r3 }; h ←− maximal height of {r1 , r2 , r3 }; l ←− {r1 , r2 , r3 }, b, x, x, h ; L ←− L ∪ l; end // Agglomerative clustering of text lines
repeat // Find two closest text lines
d ←− dmax ; for l ∈ L do for l0 ∈ L \ l do if dist(l, l0 ) < d then d ←− dist(l, l0 ); m ←− l, m0 ←− l0 ; end end end if d < dmax then // Merge the two text lines 0
M = {r1 , . . . , rn ∈ m} ∪ {r1 , . . . , rn ∈ m0 }; estimate b, x, x, h for M ; L ←− L ∪ M, b, x, x, h ; L ←− L \ m1 , m2 ; end until d < dmax ;
Algorithm 2: Text line formation
40
4.2 Text Line Formation length 3) and an initial bottom line direction b is estimated by Least Median of Squares. In addition to the bottom line estimate, a horizontal bounding-box which contains all (3) text line regions is calculated and its co-ordinates x (left), x (right) and h (height) are kept as well. Next, text lines l and l0 with smallest mutual distance dist(l, l0 ) are found, the two sets of their regions are merged together and a new bottom line direction and bounding-box co-ordinates are updated based on the merged set of text line regions. The distance between two lines dist(l, l0 ) is defined as normalized vertical distance between their bottom lines measured at the beginning and end of a bounding-box formed as a union of the two text lines’ bounding-boxes max(|bl (χ) − bl0 (χ)|, |bl (χ0 ) − bl0 (χ0 )|) min(hl , hl0 ) χ = min(xl , xl0 )
dist(l, l0 ) =
χ0 = min(xl , xl0 )
(4.13)
The process continues until the smallest distance between any two text lines is below the threshold dmax (in our experiments, we set dmax = 0.2). As a final step of the text line formation, conflicting text lines are eliminated - two (or more) text lines are in conflict if they contain the same region, which is not permitted as we assume that a character can be present in one word only (see Section 4.2.2). Conflicting text lines are typically created in images where the text is arranged into multiple rows - in such case, the same region is a member of two or more different triplets (this is quite common as there is no limitation in Algorithm 1 to ensure unique assignment of a region into one triplet), but in the second step the triplets are not merged together into a single text line because their bottom line direction is completely different. In order to eliminate the conflicts, text lines which share at least one region with another text line are first grouped into clusters based on the presence of identical regions - two text lines are a member of the same cluster if they have at least one region in common. This process therefore forms isolated text clusters consisting of multiple text lines, which are (possibly transitively) “connected” to each other by a shared region(s), and on contrary each region is present in precisely one cluster. Each cluster is then processed individually in the following iterative process: First, the text line with the the highest number of regions in the cluster is added to the output and all text lines which share a region with the selected text line (including the selected one) are removed from the cluster. The process then continues until the text cluster does not contain any text line. This can be viewed as a voting process, where in each cluster text lines vote for text direction and the text line with highest number of regions (i.e. the text direction of the longest text line) gets selected for the output, whilst all text lines which shared any region with the selected text line (which must have different text direction, otherwise they would have been merged into the selected text line in the previous process) are eliminated. Note that the algorithm is not affected by the order in which the text lines are processed, because in the first step each text cluster is a transitive closure (where the binary relation between two text lines is given by the existence of a shared region), and in the second step the ordering is given by the descending number of regions in each text line. 41
4 Text Recognition by Extremal Regions
→
&
↓
.
←
-
↑
%
Figure 4.12 Character recognition features: Input character (left). Features of the chain-code bitmap for each direction (right).
Figure 4.13 Random samples from the training set. The set contains 5580 characters from 90 fonts with no distortions, blurring or rotations
4.2.1 Character Recognition Let R denote a set of all text lines’ regions, i.e. [ R= {r1 , . . . , rn ∈ l}
(4.14)
l∈L
Each candidate region r ∈ R is labelled with a Unicode code(s) using an (approximative) nearest-neighbour classifier. ˆ A set of labels L(r) ∈ A of a region r is defined as ¯ ˆ L(r) = l(t) : t ∈ NK (f (r)) | kf (t) − f (r)k ≤ d(l(t)) (4.15)
where l(t) denotes label of the training sample (template) t, NK (f (r)) denotes K ¯ is a maxinearest-neighbours to the region r in the character feature space f , d(l) mal distance for the label (class) l and A is the set of supported Unicode characters (alphabet). In our experiments, the alphabet consisted of 26 uppercase and 26 lowercase Latin characters and 10 digits (|A| = 62). Let us also note that a region might not be assigned any label, in which case it is rejected in the following step (see Section 4.2.2). The region is first normalized to a fixed-sized matrix of 35 × 35 pixel, while retaining the centroid of the region and aspect ratio. Next, a 8-directional chain-code is generated [73] for boundary pixels and each boundary pixel is inserted into a separate bitmap depending on what direction of chain-code is assigned to it (there are 8 bitmaps of 35 × 35 pixels, one bitmap for each chain-code direction - see Figure 4.12). After Gaussian blurring (σ = 1.1) each bitmap is sub-sampled to a matrix of 5 × 5 pixels to generate 25 features for each direction. In total, 25 features × 8 directions generate 200 features per region. In our experiments, the training set consists of images with a single black letter on a white background. In total there were 5580 training samples (62 character classes in 90 different fonts). Let us note that no further distortions, blurring, scaling or rotations were artificially introduced to the training set, in order to demonstrate the power of the feature representation. The method can be also easily extended to incorporate additional characters or even scripts (see Figure 4.14), however the recognition accuracy might be affected by the increased number of classes. The nearest-neighbor classifier NK was implemented by an approximative nearest¯ neighbor classifier [85] for performance reasons and K was set to 11. The values d(l) were estimated for each class and each feature representation by a cross-validation on the training set as an average maximal distance over all folds, multiplied by a tolerance factor of β. The value of β represents a trade-off between detecting more characters from fonts not in the training set and more false positives. In our experiments, we 42
4.2 Text Line Formation
ВНИМАНИЕ Т В ЗОНЕ ПЕШЕХОДНОГО ТОННЕЛЯ ВЕДЕТСЯ КРУГЛОСУТОЧНОЕ ВИДЕОНАБЛЮДЕНИЕ С ЗАПИСЬЮ
Figure 4.14 The method can be also easily extended to incorporate additional characters or even whole scripts
Θ = 180
Θ = 193
Θ = 198
Figure 4.15 Character boundaries are often fuzzy and it is not possible to locally determine the threshold value unambiguously. Note the binarization of letters “ITE” in the word ”SUITES” - as the threshold Θ is increased their appearance goes from “IIE” through “ITE” to “m”
used the value β = 2.5, which yields the best performance on the training subset of the ICDAR dataset (see Section 4.3).
4.2.2 Sequence Selection Let us consider a word as a character sequence. Given a set of regions R from the character detector with a set of Unicode labels L(r) for each region r, the method finds a set of words (i.e. a set of character sequences) for each text line where in each sequence a region with a label corresponds to a character and the order of the regions in the sequence corresponds to the order of the characters in the word. Note that a solution might not be unambiguous, because the character detector will typically output several regions for the same character in the image (the same character can be detected with different threshold or in a different projection - see Figure 4.15), so ultimately there can be several different regions that produce an identical character sequence. Given regions r1 and r2 in a text line, r1 is an immediate predecessor of r2 r1 Pr2 if r1 and r2 are part of the same text line and the character associated with r1 immediately precedes the one associated with r2 in the sequence of characters. The predecessor relation P induces a directed graph G for each text line, such that nodes correspond to labeled regions G = (V , E )
(4.16)
ˆ V = {rl ∈ R × A : l ∈ L(r)} E=
{(rl11 , rl22 )
1
2
: r Pr }
(4.17) (4.18)
where rl denotes a region r with a label l ∈ A. Regions which don’t have any label ˆ assigned by the character classifier (i.e. L(r) = ∅) are not part of the text line graph and are therefore eliminated in this process. In the proposed method, the immediate predecessor relation P is modeled by ordering regions’ centroids in the direction of the associated text line, i.e. a region r1 is a 43
4 Text Recognition by Extremal Regions
72
i
73
73
1
74
76 77 78
r
80 81
Θ
r
P
n
84 85 89
89 90
A
τ(r1,r2) = 155-90=65 m
155
162
r1
r2
Figure 4.16 Threshold interval overlap τ (r1 , r2 ). A threshold interval is an interval of thresholds during which the region has not changed its OCR label (red). Note that as the threshold is increased the region grows or merges with other regions and the label changes
44
4.2 Text Line Formation predecessor of r2 if the centroid of r1 comes before the centroid of r2 in the text line direction, the regions’ pixels do not overlap and there is no other region closer to r2 (measured as a distance of the centroids) which satisfies the conditions. Each node rl and edge (rl11 , rl22 ) has an associated score s(rl ) and s(rl11 , rl22 ) respectively s(rl ) 1 s(rl1 , rl22 )
= α1 ψ(r) + α2 ω(rl ) 1
2
= α3 τ (r , r ) + α4 λ(l1 , l2 )
(4.19) (4.20)
where α1 . . . α4 denote weights which are determined in a training stage. Region text line positioning ψ(r) is calculated as a negative sum of squared Euclidian distances of the region’s top and bottom points from estimated position of top and bottom text line respectively. This unary term is incorporated to prefer regions which better fit on the text line. Character recognition confidence ω(rl ) estimates the probability, that the region r has the character label l based on the confidence of the character classifier (see Section 4.2.1). The estimate is calculated by taking the sum of (approximative) distances in the character feature space of at most K nearest templates from training set with the label l, normalized by the distance of the nearest template dmin dmin (r) = min d(t0 , r) 0 t ∈N (f (r))
(4.21)
K
ω(rl ) ≈
1 K
X
t∈NK (f (r)):l(t)=l
d(x, y) = kf (x) − f (y)k
dmin (r) d(t, r)
(4.22) (4.23)
Threshold interval overlap is a binary term which is incorporated to express preference for segmentations that follow one after another in a word to have a similar threshold. A threshold interval is an interval of thresholds during which the region has not changed its OCR label. A threshold interval overlap τ (r1 , r2 ) is the intersection of intervals of regions r1 and r2 (see Figure 4.16). Transition probability λ(l1 , l2 ) estimates the probability that the label (character) l1 follows after the label l2 in a given language model. Transition probabilities are calculated in a training phase from a dictionary for a given language. As a final step of the method, the directed graph is constructed with corresponding scores assigned to each node and edge (see Figure 4.17), the scores are normalized by width of the area that they represent (i.e. node scores are normalized by the width of the region and edge scores are normalized by the width of the gap between regions) and a standard dynamic programming algorithm is used to select the path with the highest score. The sequence of regions and their labels induced by the optimal path is then broken down into a sequence of words by calculating a median of spacing sm between individual regions in the whole sequence and by introducing a word boundary where the spacing is above 2sm . This process associates a sequence of words (or a single word if no word boundary is found) with each text line, which is the final output of the method.
45
4 Text Recognition by Extremal Regions
Figure 4.17 The final segmentation sequence is found as an optimal path in a directed graph where nodes correspond to labeled regions and edges represent a “is-a-predecessor” relation. Each node rc represents a region (segmentation) r with a label c (red). Node (edge) score s(rc ) resp. s(rc11 , rc22 ) denoted above the node resp. edge. Optimal path denoted in green
46
4.3 Experiments
4.3 Experiments 4.3.1 Character Detector An experimental validation shows that 85.6% characters in the ICDAR 2013 dataset [53] are detected as ERs in a single channel and that 94.8% characters are detected if the detection results are combined from all channels (see Table 4.1). A character is considered as detected if bounding box of the ER matches at least 90% of the area of the bounding box in the ground truth. In the proposed method, the combination of intensity (I), intensity gradient (∇), hue (H) and saturation (S) channels was used as it was experimentally found as the best trade-off between short run time and localization performance.
4.3.2 ICDAR 2013 Dataset The proposed method was evaluated on the ICDAR 2013 Robust Reading competition dataset [53] (which is the same dataset as in the 2011 competition), which contains 1189 words and 6393 letters in 255 images. There are some challenging text instances in the dataset (reflections, text written on complicated backgrounds, textures which resemble characters), but on the other hand the text is English only, it is mostly horizontal and the camera is typically focused on the text area. Using the ICDAR 2013 competition evaluation protocol [139], the method reaches the recall of 71.3%, precision of 82.1% and the f-measure of 76.3% in text localization (see Figure 4.19 for sample outputs). The average processing time is 1.6s per image. The method achieves significantly better recall (71%) than the winner of ICDAR 2013 Robust Reading competition (66%) and the recently published method of Yin et al. [144] (68%). The overall f-measure (76%) outperforms all published methods (see Table 4.2), but the precision is slightly worse than the winner of the ICDAR competition. The main cause for the lower precision in text localization in some images are missed characters, which then result in a word being only partially detected or detected as multiple words; this is heavily penalized by the evaluation protocol, because such word detection is assigned a precision of 0% (see Figure 4.20). The proposed method also fails to detect text where there are not enough regions to form a text line (the text formation algorithm needs to form at least one triplet - see Section 4.2), where the word consists of connected letters (even if the line is formed, such region consisting of
Channel R (%) P (%) R 83.3 7.7 G 85.7 10.3 B 85.5 8.9 H 62.0 2.0 S 70.5 4.1 I 85.6 10.1 ∇ 74.6 6.3
Channel R (%) P (%) I∪H 89.9 6.0 I∪S 90.1 7.2 I∪∇ 90.8 8.4 I∪H∪S 92.3 5.5 I∪H∪∇ 93.1 6.1 I∪R∪G∪B 90.3 9.2 I∪H∪S∪∇ 93.7 5.7 all (7 ch.) 94.8 7.1
Table 4.1 Recall (R) and precision (R) of character detection by ER detectors in individual channels and their combinations. The channel combination used in the experiments is in bold
47
4 Text Recognition by Extremal Regions
method proposed method Yin et al. [144] TexStar (ICDAR’13 winner) [53] Kim (ICDAR’11 winner) [118]
recall precision f-measure 72.4 81.8 77.1 68.3 86.3 76.2 66.4 88.5 75.9 62.5 83.0 71.3
Table 4.2 Text localization results on the ICDAR 2013 dataset
0.75 0.7 0.65
f−measure
0.6 0.55 0.5 0.45 0.4 0.35 0.3
single pass multiple rotated channels
0.25 −50 −40 −30 −20 −10 0 10 20 dataset rotation
30
40
50
Figure 4.18 Text localization performance on the rotated ICDAR 2013 dataset
method recall precision f-measure proposed method 45.4 44.8 45.2 Weinman et al. [136] 41.1 36.5 33.7 Table 4.3 End-to-end text recognition results on the ICDAR 2013 dataset (case-sensitive).
48
4.3 Experiments
EAST HILL
MILL ANTIQUES
FINE EXTINGUISHER BREAK GLASS
CANARY WHARF STATION
First ti Eastern National Bus Times
Famous for our fish chips
Figure 4.19 Text localization and recognition results on the ICDAR 2013 dataset. Ground truth marked green, method output marked red
r = 0.6
p = 0.43
r = 0.33
p = 0.33
r = 0.33
p = 0.5
Figure 4.20 The main cause for the lower precision in text localization in some images are missed characters, which then result in a word being only partially detected or detected as multiple words. Such partial/multiple detections are heavily penalized by the evaluation protocol (overall image recall r and precision p denoted below each image)
49
4 Text Recognition by Extremal Regions
(a)
(b)
(c)
(d)
Figure 4.21 Samples of missed text in the ICDAR 2013 dataset. Not enough letters to form a text line (a), very low contrast (b), letters are connected to the surrounding area which has the same color (c) and multiple characters joint together (cursive) (d).
1
precision
0.9 0.8 0.7 0.6 0.5 0.4
0.45
0.5 0.55 recall
0.6
0.65
Figure 4.22 Precision and recall of end-to-end word detection and recognition on the Street View Text dataset. The configuration with the highest f-measure (68.1%) marked red
multiple letters is then rejected by the character classifier) or where there is no threshold in any projection which separates a character from its background - see Figure 4.21. In end-to-end text recognition, the method correctly localized and recognized 549 words (46%), where a word is considered correctly recognized if all its characters match the ground truth using case-sensitive comparison, and the method “hallucinated” 60 words in total which do not have any overlap with the ground truth. The proposed method outperforms the method of Weinman et al. [136] (see Table 4.3), mostly benefiting from a superior text localization phase. The dataset was also exploited in the ICDAR 2015 Robust Reading competition (see Section 4.3.4, Table 4.7). In the second experiment, the dataset was rotated by 5◦ increments in the [−45◦ ; 45◦ ] interval, thus creating a synthetic dataset of 4845 images of multi-oriented text with complete annotations. The ability of the proposed method to detect text of different orientations was then evaluated by calculating the average text localization f-measure for each dataset rotation (see Figure 4.18). With small rotations (≤ 15◦ ) the f-measure drops only slightly (the recall remains the same, but the precision is slightly worse), but both the recall and the precision drops for rotations over 30◦ . If the pipeline is altered to rotate the input image 6 times (using 15◦ increments) and to combine the rotated channels in the sequence selection stage, the precision on the original dataset drops from 82.1% to 68.1% and the recall remains virtually the same. However, for the rotated dataset the recall is maintained across all rotations and the precision is only slightly worse for rotations over 35◦ (see Figure 4.18 - red). 50
4.3 Experiments method proposed method proposed method (general language model) T. Wang et al. [134] K. Wang et al. [132]
f-measure 68.1 64.7 46.0 38.0
Table 4.4 End-to-end word detection and recognition results on the Street View dataset
AVIS
YMCA
CASBAH
TRIPLE DOOR THE THE
HOULIHAN HOULIHAN
GRAND BOHEMIAN HOTEL
Figure 4.23 Text localization and recognition results on the SVT dataset. Ground truth marked green, method output marked red
4.3.3 Street View Text Dataset The Street View Text (SVT) dataset [132] contains 647 words and 3796 letters in 249 images harvested from Google Street View. Images in the dataset are more challenging because text is present in different orientations, the variety of fonts is bigger and the images are noisy; on the other hand, the task formulation is slightly easier because with each test image the evaluated method is also given a list of words (called lexicon) and the method only needs to localize (and therefore recognize) words from the lexicon. Let us note that not all lexicon words are in the image (these are called confuser words) and vice-versa not all words present in the image are in the lexicon. In order to make a fair comparison with the previously published work [134, 132] and to make the proposed method compatible with the aforementioned task formulation, the proposed pipe-line was slightly modified in order to exploit the presence of a lexicon. Firstly, the character transition probabilities λ(c1 , c2 ) (see Section 4.2.2) were calculated for each image individually from its associated lexicon, which makes the method prefer character sequences present in the lexicon. And secondly, the output of the method was 51
4 Text Recognition by Extremal Regions
(a)
(b)
(c)
Figure 4.24 Samples of missed text in the SVT dataset. Letters are connected to the surrounding area which has the same color (a), multiple letters joint together (b) and artistic fonts (c)
GARAGE COMMON THREADS
THE CARPENTER
SIX FEET UNDER COM
HOLLYWOOD POSTERS POSTERS
Figure 4.25 “False positives” in the SVT dataset are frequently caused by confusing actual text for a confuser word from the lexicon (left column) or by incorrect ground truth where “confuser” words from the lexicon are actually present in the image (right column)
further refined using the image lexicon - the words whose edit distance from a lexicon word is below a selected threshold were considered a match and included in the final output, whereas words whose edit distance was above the threshold were discarded. The edit distance threshold is a parameter which makes the method accept more or less similar output words as lexicon words (see Figure 4.22). Using the same evaluation protocol as [134], the proposed method achieves the fmeasure of 68.1% for the best edit distance threshold, which significantly outperforms the state-of-the-art methods (see Table 4.4). The method is able to cope with lowcontrast and noisy text and high variety of different fonts (see Figure 4.23 for output examples). The average processing time is 3.1s per image, as the dataset images have a higher resolution. It can also be observed that many of the detections which are considered as false positives are caused by actual text in the image. This is either caused by the fact that the edit distance of the detected text is too close to a confuser word (see Figure 4.25 - left column) or by incorrect ground truth where confuser words are actually present in the image (see Figure 4.25 - right column). The method fails to detect text where letters are connected to the surrounding area which has the same color, where multiple letters joint together or where the font is artistic and therefore is not present in the training set (see Figure 4.24). 52
4.3 Experiments
BREADTALK
RAMENPLAY
APACHE NOMAD DANGER KEEP OUT
RLD OF SPORTS
S150
SKYLIGHT
Figure 4.26 Text localization and recognition results on the ICDAR 2015 Incidental scene text dataset. Best viewed zoomed in color.
If general character transition probabilities λ(c1 , c2 ) for English (i.e. the same ones as in the Section 4.3.2) are used instead of lexicon-specific ones for each image, the fmeasure drops to 64.7%, which suggests that the method is still competitive even with a general language model.
4.3.4 ICDAR 2015 Competition The proposed method was used as the baseline method1 in the ICDAR 2015 Robust Reading Competition [52] (see Figure 4.26). In the 2015 competition, the emphasis was on the end-to-end text recognition evaluation, rather than on the individual subtasks (text localization, text segmentation, cropped word recognition) as in the previous years, mainly because the interpretation of individual subtasks’ results is problematic because of the evaluation methodology (see Figure 4.20 for an illustration of the problems in the text localization protocol). Three different datasets were exploited for the evaluation: Incidental scene text, Video Text and Focused scene text. The new Incidental scene text dataset contains 17, 548 annotated text regions in 1670 scene text images captured with Google Glass. The dataset consists of significantly more challenging images due to blur, different text orientations, small text dimensions and many textures similar to text. It was introduced to reduce the possibility of overfitting and to address the aforementioned problems of the ICDAR 2013 Dataset (see Section 4.3.2), which is now referred to as the Focused scene text dataset. In order to make a fair comparison between methods and to see the impact of prior knowledge, each dataset comes with three lexicons of a different size: the Strong lexicon contains 100 words specific for each image, the Weak lexicon contains all words in the dataset and the Generic lexicon contains 90K English words. 1
The proposed method did not participate in the competition directly to avoid any conflict of interest because the authors helped with data annotation of the newly introduced dataset
53
4 Text Recognition by Extremal Regions
Strong Weak Generic p r f p r f p r f Stradvision-2 67.9 32.2 43.7 proposed method 62.2 24.4 35.0 25.0 16.6 19.9 18.3 13.6 15.6 Stradvision-1 28.5 39.7 33.2 NJU 48.8 24.5 32.6 BeamSearch CUNI 37.8 15.7 22.1 33.7 14.0 19.8 29.6 12.4 17.5 Deep2Text-MO [144, 143] 21.3 13.8 16.8 21.3 13.8 16.8 21.3 13.8 16.8 OpenCV+Tessaract 40.9 8.3 13.8 32.5 7.4 12.0 19.3 5.0 8.0 BeamSearch CUNI+S 81.0 7.2 13.3 64.7 5.9 10.9 35.0 3.8 6.9 Method
Table 4.5 ICDAR 2015 Robust Reading competition [52] - End-to-end Incidental text recognition
Method MOTP MOTA ATA proposed method 69.5 59.8 41.8 Stradvision-1 69.2 56.5 28.5 USTB-TexVideo [144, 143] 65.1 45.8 19.8 Deep2Text-I [144, 143] 62.1 35.4 18.6 USTB-TexVideo-II-2 [144, 143] 63.5 50.5 17.8 USTB-TexVideo-II-1 [144, 143] 60.5 21.2 13.8 Table 4.6 ICDAR 2015 Robust Reading competition [52] - End-to-end Video text recognition
Strong Weak Generic p r f p r f p r f VGGMaxBBNet [47] 89.6 83.0 86.2 Stradvision-1 88.7 75.0 81.3 84.0 73.7 78.5 69.5 65.0 67.2 proposed method 85.9 69.8 77.0 61.5 64.8 63.1 50.7 58.1 54.2 Deep2Text-II [144, 143] 81.7 69.8 75.3 81.7 69.8 75.3 81.7 69.8 75.3 NJU 80.2 69.6 74.5 Deep2Text-I [144, 143] 84.0 66.7 74.4 84.0 66.7 74.4 84.0 66.7 74.4 MSER-MRF 84.5 61.4 71.13 BeamSearch CUNI 67.9 59.0 63.1 65.1 57.5 61.0 59.4 52.9 56.0 OpenCV+Tessaract 75.7 49.0 59.5 69.5 47.1 56.0 51.0 37.6 43.3 BeamSearch CUNI+S 92.8 15.4 26.4 89.1 13.5 23.3 65.5 12.0 20.3 Method
Table 4.7 ICDAR 2015 Robust Reading competition [52] - End-to-end Focused text recognition
54
4.3 Experiments On the Incidental Scene Text dataset, the proposed method placed second using the Strong lexicon (topped only by the deep-network based StradVision method, which is not published) and placed first using the Weak lexicon (see Table 4.5). The best result with the Generic lexicon is achieved by the BeamSearch method, which is based on the proposed method differing only is its more sophisticated language model. For the cropped word recognition subtask, the proposed method recognized 14.2% words correctly. The main reason for the lower performance when compared to the endto-end setup is the requirement to initially detect at least 3 characters on a line, which is less likely to be successful for images of individual cut out words (and impossible for words containing less than 3 characters) - 33.8% individual words were missed completely in the cropped word recognition setup because of this limitation. On the Video Text dataset containing 15 test video sequences, the proposed method (by processing the video frame by frame and by feeding its output to the FoT tracker [131]) outperformed all participants (see Table 4.6) in all three metrics [54]: the Multiple Object Tracking Precision (MOTP), the Multiple Object Tracking Accuracy (MOTA) and the Average Tracking Accuracy (ATA). On the Focused Scene Text dataset (i.e. the ICDAR 2013 Dataset), the proposed method in the end-to-end setup is outperformed only by the deep network of Jaderberg et al. [47] trained on significantly more data and the deep-network based StradVision method, which is not published (see Table 4.7).
55
5 Efficient Unconstrained Scene Text Detector 5.1 FASText Keypoint Detector The FASText keypoint detector is, as the name suggests, based on the well-known FAST corner detector by Rosten and Drummond [111, 112]. The standard FAST detects certain letters by firing on character corners (e.g. the letter “L” or “P”) or corners of character stroke endings if the character is sufficiently thick (e.g. the ending of the letter “l”), but is unable to detect characters whose stroke doesn’t have a corner or an ending (e.g. the letter “O” or the digit “8”). Moreover, in a typical scene image the standard FAST detector produces many false and repeated detections (see Section 5.5.1), which unnecessarily slows down the processing. Considering we are only interested in detecting character strokes, the FAST detector is modified to introduce two novel keypoint classes: the Stroke Ending Keypoint (SEK) matches a stroke ending, whilst the Stroke Bend Keypoint (SBK) matches a curved segment of a stroke (see Figure 5.2). For each pixel p in an image, pixel intensities I around a circle of 12 pixels x ∈ {1 . . . 12} are examined and each pixel x is assigned one of three labels Ix ≤ Ip − m (darker) d, L(p, x) = s, Ip − m < Ix < Ip + m (similar) (5.1) b, Ix ≥ Ip + m (brighter)
where m is a margin, which is a parameter of the detector. The pixel p is a Stroke Ending Keypoint (SEK) if there exists two contiguous partitionings Ps and Pd (or Ps and Pb ) such that |Ps | ∈ {1, 2, 3} and |Pd | = 12 − |Ps | (or |Pb | = 12 − |Ps |), where Pl denotes a contiguous partitioning of the pixels x with the label l. In other words, the pixel p is a SEK if there exists a contiguous circle segment of at least 9 pixels which are darker (or brighter) than the pixel p, whilst the remaining pixels of the circle have a similar intensity to the pixel p. The keypoint can be either
Figure 5.1 The FASText detector output. Stroke End Keypoints (SEK) marked red, Stroke Bend Keypoints (SBK) marked blue.
56
5.1 FASText Keypoint Detector Pb 7
6
5
7 8
9
3
10
2
6
5
Pb′
4
8
4
9
3
10 11
0
2
Pb
1
11
0
1
Figure 5.2 The Stroke Ending Keypoint (left) and the Stroke Bend Keypoint (right). Pixels of the Ps partitioning marked red, pixels of the Pb partitioning marked white, inner pixels Pi for the connectivity test in purple.
Pyramid
Grid Adapter
Keypoint Detector
FASTex Detector
FASTex Detector
24 ms
Imprecision: 99
Segmentation Classification
Segmentation
58 ms
Text Clustering
41 ms
25
11
Figure 5.3 The proposed pipeline. The average processing time and the imprecision (number of false and repeated detections) for a 1MPx image denoted below each stage.
positive or negative, depending whether the intensity of the stroke is higher or lower than the background. Using the same notation, the pixel p is a Stroke Bend Keypoint (SBK) if there exists four contiguous partitionings Ps , Ps0 , Pd , Pd0 or (Ps , Ps0 , Pb , Pb0 ) such that |Ps |, |Ps0 | ∈ {1, 2, 3} and |Pd | > 6, |Pd0 | = 12 − |Pd | − |Ps | − |Ps0 | (or |Pb | > 6, |Pb0 | = 12 − |Pb0 | − |Ps | − |Ps0 |). The pixel p is a SBK if there is a contiguous circle segment of at least 6 pixels which are darker (or brighter) than the pixel p, two distinct circle segments which have similar intensity to the pixel p and the remaining pixels on the circle are darker (or brighter) than the pixel p. The implementation of the aforementioned tests is very straightforward and the tests can be computed by a single pass around the 12 pixel circle. The computational complexity of the detector is reduced even further (by the factor of 2) by inserting a simple rule, which examines the opposite pixels and tests that all opposite pixels are brighter than Ip + t or darker than Ip − t. If none of the conditions is met, the pixel p cannot be a FASText keypoint and is quickly rejected without any further processing. The final verification step of the FASText detector is a connectivity test, which ensures the inner circle pixels Pi between the pixel p and the pixels Ps also satisfy the intensity margin, i.e. Ip − m < Ix < Ip + m ∀x ∈ Pi (see Figure 5.2). The purpose of the test is to eliminate false detections, because if the pixel p is placed on a stroke, the pixels in the Ps partitioning(s) must be connected to it. Let us note that this test does not represent any significant overhead, as in the worst case only 3 pixels have to be examined. 57
5 Efficient Unconstrained Scene Text Detector
140 120
Imprecision
100
FASTex-16 FASTex
8.0
13.0
80
18.0 21.0 23.0
60 40 20 0
50
100
150
200 250 Missed Letters
300
350
400
450
Figure 5.4 The margin parameter m controls the trade-off between imprecision (the number of false and repeated detections) and the number of missed characters. The value used in experiments marked by the red cross, the detector with the circle size of 16 pixels denoted FASText-16.
In order to eliminate FASText keypoints which lie close to each other, a simple nonmaximum suppression is performed on a 3×3 neighborhood and only the keypoint with the highest contrast (i.e. max(Ix − Ip ) : x ∈ Pb respectively max(Ip − Ix ) : x ∈ Pd ) is kept. The optimal values of the detector parameters were found experimentally using the ICDAR 2013 Training dataset [53], which contains 4784 characters in 229 images. A character is considered as detected, if there is at least one keypoint whose position intersects with the the character ground truth segmentation. The value of each parameter was chosen to obtain the best trade-off between the detector imprecision and the number of missed characters (recall). The detector imprecision is a precision-like metric designed to cater for repeated |D| detections of the same character and is calculated as |GT | , where |D| is the number of detections and |GT | is the number of characters in the ground truth. For example, the imprecision of 10 implies that a detector produces 10 times more detections than characters in the ground truth, but it does not suggest what ratio of ground truth characters is actually detected. The circle size of 12 pixels is the first detector parameter. Its value is lower than the original FAST [111] value of 16 pixels to allow detection of characters (strokes) which are close to each other (see Figure 5.4). Let us note that a circle size smaller than 12 pixels is not possible because of the connectivity test. The second detector parameter is the margin m, which controls the trade-off between imprecision (the number of false and repeated detections) and the number of missed characters (see Figure 5.4). The optimal margin value was found to be m = 13. Because the FASText detector is only triggered by strokes whose width is comparable to the pixel circle radius (i.e. two or three pixel wide), the keypoints are detected in an image scale-space to allow detection of wider strokes. Each level of the pyramid is calculated from the previous one by reducing the image size by the scaling factor f (in our implementation, bilinear approximation was used for image resizing). The scaling factor f is the third parameter of the detector and its optimal value was experimentally found to be 1.6 (see Figure 5.5). The last parameter of the detector is the maximum number of keypoints per image. 58
5.2 Keypoint Segmentation
140
1.2 130 1.3
120 Imprecision
1.4
110 1.5
100
1.6
1.7 90 1.8
1.9 80 0
50
100
150
200
250
300
350
400
Missed Letters
Figure 5.5 The scaling parameter f controls the trade-off between imprecision and the number of missed characters. The value used in experiments marked by the red cross.
An input image is partitioned into uniformly-sized cells (see Figure 5.3) and the number of detected keypoints in each cell is limited by ordering the keypoints by their contrast and eliminating the keypoints whose position in the ordered set is above the cell limit. The value of 4000 keypoints per image was chosen as a value commonly used by standard keypoint detectors [113].
5.2 Keypoint Segmentation As successfully demonstrated by the methods based on MSERs [92, 144], individual characters can be segmented from the background using a threshold value unique for each character (in MSERs, the threshold value is found as the center of the region stability interval). In the proposed method, the threshold value is found directly from the FASText keypoint. Given a positive FASText keypoint p and its associated set of darker pixels Pd , the segmentation threshold θp is the intensity value just below the intensity of the darkest pixel in Pd θp = min(Ix ) − 1 | x ∈ Pd
(5.2)
Similarly, for a negative FASText keypoint, the segmentation threshold θp is the intensity value just above the intensity of the darkest pixel in Pb θp = max(Ix ) + 1 | x ∈ Pb
(5.3)
The threshold value θp is then effectively exploited by a standard flood-fill algorithm [126] to generate a stroke for each FASText keypoint.
5.3 Segmentation Classification In order to reduce the still relatively high false detection rate of the FASText detector (the average imprecision is 25 segmentations to one ground truth character) and to make the processing in the subsequent stages faster, an efficient classification stage is inserted to filter the output of the proposed detector. 59
5 Efficient Unconstrained Scene Text Detector The classification uses the concept of text fragments, where a text fragment can be a single character, a group of characters, a whole word or a part of a character. This allows the classifier to discriminate between text and clutter, regardless of whether a character is only partially detected, or whether a group of characters is detected as one region. As a result, the common assumption of region-based methods (see Section 2.1.2) that one region corresponds to one character is dropped, allowing for a higher recall.
5.3.1 Character Strokes Area The Character Strokes Area (CSA) feature is based on the observation that a character can be drawn by taking a brush with a diameter of the stroke width and drawing through middle points of the character. In order to estimate the “strokeness” of a region we introduce a novel feature based on Stroke Support Pixels (SSPs) which exploits the observation that one can draw any character by taking a brush with a diameter of the stroke width and drawing through certain points of the character (see Figure 5.7) - we refer to such points as stroke support pixels (SSPs). The SSPs have the property that they are in the middle of a character stroke, which we refer to as the stroke axis, the distance to the region boundary is half of the stroke width, but unlike skeletons they do not necessary form a connected graph. Since the area (i.e. the number of pixels) of an ideal stroke is the product of the stroke width and the length of the stroke, the “strokeness” can be estimated by the Character Strokes Area (CSA) feature ς which compares the actual area of a region A with the ideal stroke area calculated from the SSPs As ς=
As A
(5.4)
The feature estimates the proportion of region pixels which are part of a character stroke and therefore it allows to efficiently differentiate between text regions (regardless of how many characters they represent - see Figure 5.8) and the background clutter. The feature is efficiently computed from a region distance map, it is invariant to scaling and rotation and it is more robust to noise than methods which aim to estimate a single stroke width value [29] as small pixel changes do not cause unproportional changes to the estimate. In order to estimate the character strokes area, a distance transform map is calculated for the region binary mask and only pixels corresponding to local distance maxima are considered (see Figure 5.6). These are the Stroke Support Pixels (SSPs), because the pixels determine the position of a latent character stroke axis. In order to estimate the area of the character strokes A¯s , one could simply sum the distances associated with the SSPs X A¯s = 2 di (5.5) i∈S
where S are the SSPs and di is the distance of the pixel i from the boundary. Such an estimate is correct for an straight stroke of an odd width, however it becomes inaccurate for strokes of an even width (because there are two support pixels for a unitary stroke length) or when the support pixels are not connected to each other as a result of stroke curvature, noise at the region boundary or changing stroke width (see Figure 5.6). We therefore propose to compensate the estimate by introducing a weight wi for each SSP, which ensures normalization to a unitary stroke length by counting 60
5.3 Segmentation Classification
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
i
i
Ni
Ni
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 2.87 1.91 0.96
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
SW
SW
3 ∗ 3.82 = 68.76 3 As 68.76 ς= = = 0.98 A 70
3 ∗ 3.82 = 68.76 6 As 68.76 ς= = = 0.86 A 80
As = 2 ∗ 9 ∗
As = 2 ∗ 18 ∗
(a)
(b) 0.96 0.96 0.96 0.96 0.96 0.96 1.36 1.91 1.91
0.96 1.36 2.32 2.73 2.87 3.27 3.69 3.82 4.23 3.82 3.82
SW
0.96 0.96 0.96 1.36 1.91 1.91 2.32 2.73 2.87 0.96 1.36 1.91 1.91 2.32 2.73 2.87 3.27 3.69 3.82
3 1
3 4 3 4
3 1
0.96 1.36 2.32 2.73 3.69 3.82 4.23 3.82 3.69 3.27 2.87 2.87
3 2
0.96 1.91 2.73 3.69 4.11 3.69 3.27 2.87 2.73 2.32 1.91 1.91
3 3 3 2
0.96 1.36 2.32 3.28 4.11 3.69 2.73 2.32 1.91 1.91 1.37 0.96 0.96 0.96 1.91 2.73 3.69 4.11 3.27 2.32 1.37 0.96 0.96 0.96 0.96 1.36 2.32 3.27 4.11 3.69 2.73 1.91 0.96
3 1
0.96 1.91 2.73 3.69 4.23 3.27 2.32 1.36 0.96
i
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
Ni
3 2
0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
3 2
i Ni
0.96 1.91 2.87 3.82 3.69 2.73 1.91 0.96
3 1
0.96 1.37 2.32 3.28 4.23 3.28 2.32 1.37 0.96 0.96 1.91 2.73 3.69 3.82 2.87 1.91 0.96 0.96 1.91 2.87 3.82 3.82 2.87 1.91 0.96
3 4
3 4
SW
3 3 3 3 ∗ 3.82 + ∗ 3.82 + ∗ 4.23 + · · · + ∗ 3.82) = 180.24 4 4 1 4 As 180.24 ς= = = 0.96 A 187
As = 2 ∗ (
(c)
(d)
Figure 5.6 Character Strokes Area (CSA) ς calulation for a straight stroke of an odd (a) and even (b) width and for a curved stroke - distance map di (c) and Stroke Support Pixel weights wi (d). Stroke Support Pixels (SSPs) denoted red
61
5 Efficient Unconstrained Scene Text Detector S
. P A = sw ∗ sl = 2 di
sw 1 di = 2 sw
i∈S
sl
Figure 5.7 Area A of an ideal stroke is a product of the stroke width sw and the length of the stroke sl . This is approximated by summing double the distances di of Stroke Support Pixels (SSPs) along the stroke axis S
0.85
0.75
0.93
0.95
0.74
0.82
0.17
0.42
0.71
0.71
0.34
0.54
Figure 5.8 Examples of the Character Strokes Area (CSA) ς values for character (top row), multi-character (middle row) and background (bottom row) connected components. Distance map denoted by pixel intensity, Stroke Support Pixels (SSPs) denoted red
the number of pixels in a 3 × 3 neighborhood As = 2
X
wi di
wi =
i∈S
3 |Ni |
(5.6)
where |Ni | denotes the number of SSPs within the 3 × 3 neighborhood of the pixel of i (including the pixel i itself). The numerator value is given by the observation that for a straight stroke, there are 3 support pixels in the 3 × 3 neighborhood (see Figure 5.6a).
5.3.2 Approximate Character Strokes Area The main drawback of the CSA feature introduced in Section 5.3.1 is its computational complexity, given by the need to calculate a distance map and employ an iterative nonmaxima suppression over the whole region. We therefore also propose an approximate, but significantly faster calculation of the Character Strokes Area feature. Given a segmentation r, a set of Stroke Straight Keypoints (SSK) is found for each Stroke Ending Keypoint (SEK) p which intersects with the segmentation r using the following iterative algorithm (see Figure 5.9): 1. Take the Stroke Ending Keypoint p as the starting point 62
5.3 Segmentation Classification 7
6
5
8
4
9
3
10
2
11
0
1
7
6
5
8
4
9
7
3
10
2
11
0
1
6
5
8
4
9
3
10
2
11
(a)
(b)
0
1
(c)
(d)
Figure 5.9 Character Stroke Area (CSA) approximation by FASText keypoints. Initial Stroke Ending Keypoint (a). First Stroke Straight Keypoint found (b). Next Stroke Straight Keypoint found (c). All Stroke Straight Keypoints found, stroke length illustrated by the red line (d).
Full CSA (Section 5.3.1) Approximated CSA (Section 5.3.2)
t / region [ms] 0.895 0.015
Table 5.1 Character Strokes Area calculation time comparison.
2. Move the point p to the darkest (brightest) pixel of the Ps pixels (always in the direction away from the stroke ending) 3. The point p is a Stroke Straight Keypoint (SSK) if there are four contiguous partitionings Ps , Ps0 , Pd , Pd0 or (Ps , Ps0 , Pb , Pb0 ) such that |Ps |, |Ps0 | ∈ {1, 2, 3} and |Pd | > 3, |Pd0 | = 12 − |Pd | − |Ps | − |Ps0 | (or |Pb | > 3, |Pb0 | = 12 − |Pb0 | − |Ps | − |Ps0 |) 4. If the point p is a SSK, repeat from the step 2, otherwise terminate The character strokes area As (r) of the segmentation r is then calculated as X X As (r) = 3|Ps |p + 3 |Ps |p + |Ps |0p p∈SSKr p∈SSBr
(5.7)
where SSKr and SBKr is the set of Stroke Straight respectively Stroke Bend Keypoints intersecting with the segmentation r and |Ps |p (|Ps |0 p ) is the size of the partitioning Ps (Ps0 ) associated with the keypoint p. In the end, four rotation- and scale-invariant features are employed by a Gentle AdaBoost classifier [34] to classify FASText segmentations as either a text fragment (typically a character) or a background clutter: compactness, convex hull area ratio, holes area ratio (all calculated as part of the segmentation process) and the Character Strokes Area (CSA). The proposed approximate algorithm is almost 60 times faster (see Table 5.1), yet the impact on the classification accuracy is negligible (see Figure 5.10).
63
5 Efficient Unconstrained Scene Text Detector Text Fragment Classifier ROC
1.0
0.8
TP
0.6
0.4
0.2
0.0 0.0
Full Approximated Without CSA 0.2
0.4
FP
0.6
0.8
1.0
Figure 5.10 The impact of the Character Strokes Area (CSA) feature in the segmentation classification. The full CSA feature (blue), the approximated CSA feature (green), a classifier without the CSA feature (red)
5.4 Text Clustering In this stage, the unordered set of FASText segmentations classified as text fragments is clustered into ordered sequences, where each cluster (sequence) shares the same text direction in the image. In other words, individual characters (or groups of characters or their parts) are clustered together to form lines of text. Let us denote the set of text fragment segmentations as R. For each segmentation ri ∈ R, all segmentations rj ∈ R are found, such that ri and rj are neighbors. The segmentation ri is a neighbor of rj (denoted N (ri , rj ), if they are sufficiently close to each other (the distance is measured as the distance of their centroids - see Figure 5.11a) and they have a comparable scale p i ), A(rj )) 1, kc(ri ) − c(rj )k < α max(A(r A(ri ) A(rj ) N (ri , rj ) = max A(rj ) , A(ri ) < β 0, otherwise
(5.8)
where c(r) is the centroid of the segmentation r and A(r) is the convex hull area of the segmentation r. The parameter values α = 4 and β = 10 were chosen experimentally and they provide sufficient tolerance for a vast majority of typographical models. Since the segmentation neighbor search is a simple search in a 2D space of points (centroids), it can be effectively implemented by partitioning the image into smaller cells and always considering segmentations just in the closest cells. Alternatively, one could use a standard (approximate) nearest-neighbor algorithm. Each pair of neighboring segmentations then casts a vote for their text direction, where the text direction is given by the line which passes through the two centroids of the neighboring pair (see Figure 5.11b). Drawing inspiration from the well-known Hough transform [11], each vote for a text direction is represented in the polar system ρ = x sin(θ) + y cos(θ), so that vertical text lines can also be detected. The two-dimensional parameter space (ρ, θ) is quantized into a fixed-sized matrix, so that small differences in the line parameters are eliminated and the text directions with the highest number of votes (i.e. the directions with the highest number of supporting pairs) can be easily found as local maxima in the matrix (see Figure 5.11c). 64
5.5 Experiments
(a)
(b)
1200 1400 1600
ρ
1800 2000 2200 2400 2600 2800 3000 0
0.5π
θ
(c)
π
(d)
Figure 5.11 Segmentations are clustered to form lines of text. A selected segmentation r and the radius of neighbor search (a). All neighboring pairs of the segmentation r and their corresponding text directions (b). Quantized text direction votes in the (ρ, θ) parameter space (c). Final text line clusters (d).
Each local maxima with its parameters (ρ, θ) then unambiguously induces a text cluster by simply taking all segmentations whose centroid lies on the line (ρ, θ) (or the distance is smaller than the quantization error) and ordering them in the direction of the line. Since one segmentation can lie on multiple lines with different parameters (ρ, θ), the local maxima are processed in a decreasing order of number of their votes and each segmentation is allowed to be included only in a single text cluster. This process ensures that longer text lines are preferred over shorter ones and that intra-line text clusters are eliminated (see Figure 5.11d).
5.5 Experiments 5.5.1 Character Detection In the first experiment, the character detection ability of the FASText keypoint detector is compared with the existing detectors. The evaluation uses the standard ICDAR 2013 Test dataset (which is commonly used for text localization evaluation - see Section 5.5.2) containing 6393 characters in 255 images annotated on the pixel level, i.e. ground truth character segmentations are provided. In Table 5.2, keypoints detected by the FASText and the FAST detector [111] are compared by processing all images in the dataset and calculating keypoint statistics for each detector (see Figure 5.13 for a visual comparison on a sample image). The number of total detections is almost identical for both detectors, but the proposed FASText detector detects 3 times more characters (only 111 characters are missed against 328 missed characters by the FAST detector). A character is considered as 65
5 Efficient Unconstrained Scene Text Detector
Figure 5.12 Scene text images with different scripts, fonts and orientations. Source images (top row), detected FASText keypoints (middle row) and resulting text segmentations (bottom row). Best viewed zoomed in color.
Figure 5.13 Keypoints detected by the FAST (left) and by the FASText detector (right). The size of the mark is proportional to the scale where the keypoint was detected.
66
5.5 Experiments
FAST [111] FASText
|D| 608992 574713
|D| |GT |
105.1 99.1
|F N | 328 111
t [ms] 29.62 24.77
Table 5.2 Keypoints detected on characters in the ICDAR 2013 dataset. The number of de|D| tected keypoints |D|, imprecision |GT | , the number of characters without a keypoint |F N | and the average time per image t. |D| |D| |GT | |F N | t [ms] MSER [80] 401972 69.4 1128 328.09 ER detector (Section 4.1) 636729 109.9 657 1068.41 FASText 215325 37.2 857 82.01 62394 10.8 1240 122.10 FASText+AdaBoost
Table 5.3 Detected segmentations in the ICDAR 2013 dataset. The number of segmentations |D| |D|, imprecision |GT | , the number of characters without a valid segmentation |F N | and the average time per image t.
missed by a detector, if there is no keypoint whose position coincides with any pixel in the character ground truth segmentation. The FASText detector is also 20% faster than the FAST detector. In Table 5.3, segmentations produced by the detectors commonly used in scene text localization are compared with the segmentations produced by the proposed detector on all images in the ICDAR dataset. Both the standalone FASText detector and the FASText detector with the subsequent classification stage (see Section 5.3) are included, whilst the FAST detector is not included as it does not provide segmentations. A character is considered as detected by a detector, if the detector produces a segmentation, whose bounding-box overlap with the character ground truth bounding-box is above 60%. If no such segmentation exists, a character is considered as missed. The FASText detector produces 2 times less segmentations and still detects 25% more characters than the commonly exploited MSER [80] detector and at the same time it is 4 times faster. The FASText detector with the proposed subsequent classification phase produces 7 times less segmentations and is almost 3 times faster than the MSER detector. The number of missed characters is 10% higher than the MSER detector, where the characters incorrectly rejected by the classifier are typically thin letters such as “i” or “l”.
5.5.2 Text Localization and Recognition In order to evaluate scene text localization ability of the proposed detector, we have adapted the ER-based pipeline from Section 4 and replaced the initial stages (ER detection, character classification and the text line formation) with the proposed detector. The resulting experimental pipeline therefore consists of the FASText detector with the proposed segmentation classification and the text clustering, followed by the local iterative segmentation refinement and character recognition adapted from the original pipeline. In the first experiment, the pipeline was evaluated on the most cited ICDAR 2013 Robust Reading Dataset [53], which consists of 1189 words and 6393 letters in 255 images. There are many challenging text instances in the dataset (reflections, text 67
5 Efficient Unconstrained Scene Text Detector
CLOSING DOWN SALE
UOX MINI MAYFAIR
PEUGEO1
First ff
FREEDON
SMOUl
EMERGENCY SERVICE VEHICLES ONLY
CABOTPLACE
Figure 5.14 Text localization and recognition examples from the ICDAR 2013 dataset.
68
5.5 Experiments method FASText pipeline ER pipeline (Chapter 4) Zamberletti et al. [147] Yin et al. [144] ICDAR 2013 winner [53]
R 69.3 72.4 70.0 68.3 66.4
P 84.0 81.8 85.6 86.3 88.5
F tl tr t 76.8 0.15 0.4 0.55 77.1 ? ? 0.8 77.0 0.75 N/A N/A 76.2 0.43 N/A N/A 75.9 ? ? ?
Table 5.4 Text localization results and average processing times on the ICDAR 2013 dataset. Recall R, precision P , f-measure F , localization time tl , recognition time tr and the total processing time t (in seconds).
written on complicated backgrounds, textures which resemble characters), but on the other hand the text is English only, it is mostly horizontal and the camera is typically focused on the text area (see output examples in the Figure 5.14). Using the same evaluation protocol as the latest ICDAR 2013 Robust Reading competition [53], the text localization accuracy compares favorably to the state-of-the-art methods (see Table 5.4), whilst the proposed method is significantly faster - in text localization (tl ), the proposed pipeline is 3 times faster than the best method. In the second experiment, the pipeline (without the text recognition stage) was qualitatively evaluated on a dataset with a wide variety of scripts, fonts and text orientations. As demonstrated in the Figure 5.12, the FASText keypoint detector is able to detect many different scripts, fonts and orientations and it together with the subsequent steps can be easily exploited to produce text line segmentations.
69
6 Single Shot Text Detection and Recognition In this chapter, we propose a novel end-to-end framework which simultaneously detects and recognizes text in scene images. As the first contribution, we present a model which is trained for both text detection and recognition in a single learning framework, and we show that such joint model outperforms the combination of state-of-the-art localization and state-of-the-art recognition methods [41, 38]. As the second contribution, we show how the state-of-the-art object detection methods [109, 110] can be extended for text detection and recognition, taking into account specifics of text such as the exponential number of classes (given an alphabet A, there are up to AL possible classes, where L denotes maximum text length) and the sensitivity to hidden parameters such as text aspect and rotation. The method achieves state-of-the-art results on the standard ICDAR 2013 [53] and ICDAR 2015 [52] datasets and the pipeline runs end-to-end at 10 frames per second on a NVidia K80 GPU, which is more than 10 times faster than the fastest methods.
Figure 6.1 The proposed method detects and recognizes text in scene images at 10fps on an NVidia K80 GPU. Ground truth in green, model output in red. The image taken from the ICDAR 2013 dataset [53]
70
6.1 Fully Convolutional Network
Source Image
Fully Convolutional Net
Region Proposals Region Scores
Best Proposals
Geometric Normalization
CTC
Region Transcription
Bilinear Sampler
CNN
-RR-I-VE-R-S--III-D--E----------------------------------------------------------W-AA-LLLK---------
CNN 0.98 0.66 0.04 0.07 . . . 0.22
Figure 6.2 Method overview. Text region proposals are generated by a Region Proposal Network [109]. Each region with a sufficient text confidence is then normalized to a variablewidth feature tensor by bilinear sampling. Finally, each region is associated with a sequence of characters or rejected as not text.
6.1 Fully Convolutional Network The proposed model localizes text regions in a given scene image and provides text transcription as a sequence of characters for all regions with text (see Figure 6.2). The model is jointly optimized for both text localization and recognition in an end-to-end training framework. We adapt the YOLOv2 architecture [109] for its accuracy and significantly lower complexity than the standard VGG-16 architecture [120, 50], as the full VGG-16 architecture requires 30 billion operations just to process a 224×224 (0.05 Mpx) image [109]. Using YOLOv2 architecture allows us to process images with higher resolution, which is a crucial ability for text recognition - processing at higher resolution is required because a 1Mpx scene image may contain text which is 10 pixels high [52], so scaling down the source image would make the text unreadable. The proposed method uses the first 18 convolutional and 5 max pool layers from the YOLOv2 architecture, which is based on 3 × 3 convolutional filters, doubling the number of channels after every pooling step and adding 1 × 1 filters to compress the representations between the 3 × 3 filters [109]. We remove the fully-connected layers to make the network fully convolutional, so our model final layer has the dimension of W H 32 × 32 × 1024, where W a H denote source image width and height [109].
6.2 Region Proposals Similarly to Faster R-CNN [110] and YOLOv2 [109], we use a Region Proposal Network (RPN) to generate region proposals, but we add rotation rθ which is crucial for a successful text recognition. At each position of the last convolutional layer, the model predicts k rotated bounding boxes, where for each bounding box r we predict 6 features - its position rx , ry , its dimensions rw , rh , its rotation rθ and its score rp , which captures the probability that the region contains text. The bounding box position and dimension is encoded with respect to predefined anchor boxes using the logistic activation function, so the actual bounding box position (x, y) and dimension (w, h) in the source image is given as x = σ(rx ) + cx
(6.1)
y = σ(ry ) + cy
(6.2)
w = aw exp(rw )
(6.3)
h = ah exp(rh )
(6.4)
θ = rθ
(6.5) 71
6 Single Shot Text Detection and Recognition 3
2
1
0
−1
−2
−3
−6
−4
−2
0
2
4
6
Figure 6.3 Anchor box widths and highs, or equivalently scales and aspects, were obtained by k-means clustering on the training set. Requiring that each ground truth box had intersectionover-union of at least 60% with one anchor box led to k = 14 boxes.
where cx and cy denote the offset of the cell in the last convolutional layer and aw and ah denote the predefined height and width of the anchor box a. The rotation θ of the bounding box is predicted directly by rθ . We followed the approach of Redmon et al . [109] and found suitable anchor box scales and aspects by k-means clustering on the aggregated training set (see Section 6.5). Requiring the anchor boxes to have at least 60% intersection-over-union with the ground truth led to k = 14 different anchor boxes dimensions (see Figure 6.3). H For every image, the RPN produces W 32 × 32 × 6k boxes (there are 6 estimated parameters for every anchor - x,y,w,h,θ and the text score rp ), which would make subsequent computation too complex and it is therefore necessary to only select a subset. In the training stage, we use the YOLOv2 approach [109] by taking all positive and negative samples in the source image, where every 20 batches we randomly change the input dimension size into one of {352, 416, 480, 544, 608}. A positive sample is the region with the highest intersection over union with the ground truth, the other intersecting regions are negatives. At runtime, we found the best approach is to take all regions with the score rp above a certain threshold pmin and to postpone the non-maxima suppression after the recognition stage, because regions with very similar rp scores could produce very different transcriptions, and therefore selecting the region with the highest rp at this stage would not always correspond to the correct transcription (for example, in some cases a region containing letters “TALY” may have slightly higher score rp than a region containing the full word “ITALY”). We found the value pmin = 0.1 to be a reasonable trade-off between accuracy and speed.
72
6.3 Bilinear Sampling Type input conv conv maxpool conv BatchNorm recurrent conv maxpool conv BatchNorm recurrent conv maxpool conv BatchNorm recurrent conv maxpool conv conv conv log softmax
Channels C 32 32 64 64 128 128 256 256 512 512 ˆ |A|
Size/Stride 3×3 3×3 2 × 2/2 3×3
Dim/Act W × 32 leaky ReLU leaky ReLU W /2 × 16 leaky ReLU
3×3 2 × 2/2 3×3
leaky ReLU W /4 × 8 leaky ReLU
3×3 2 × 2/2 × 1 3×3
leaky ReLU W /4 × 4 leaky ReLU
3×3 2 × 2/2 × 1 3×2 5×1 7×1
leaky ReLU W /4 × 2 leaky ReLU leaky ReLU W /4 × 1
Table 6.1 Fully-Convolutional Network for Text Recognition
6.3 Bilinear Sampling Each region detected in the previous stage has a different size and rotation and it is therefore necessary to map the features into a tensor of canonical dimensions, which can be used in recognition. Faster R-CNN [110] uses the RoI pooling approach of Girshick [35], where a w × h × C region is mapped onto a fixed-sized W 0 × H 0 × C grid (7 × 7 × 1024 in their w h implementation), where each cell takes the maximum activation of the W ×H cells in the underlying feature layer. In our model, we instead use bilinear sampling [48, 50] to map a w × h × C region 0 0 0 from the source image into a fixed-height wH h × H × C tensor (H = 32). This feature representation has a key advantage over the standard RoI approach as it allows the network to normalize rotation and scale, but at the same to persist the aspect and positioning of individual characters, which is crucial for text recognition accuracy (see Section 6.4). Given the detected region features U ∈ Rw×h×C , they are mapped into a fixed-height tensor V ∈ R
wH 0 ×H 0 ×C h
as
Vxc 0 ,y0 =
w X h X x=1 y=1
Ucx,y κ(x − Tx (x0 ))κ(y − Ty (y 0 ))
(6.6)
where κ is the bilinear sampling kernel κ(v) = max(0, 1 − |v|) and T is a point-wise coordinate transformation, which projects co-ordinates x0 and y 0 of the fixed-sized tensor V to the co-ordinates x and y in the detected region features tensor U. The transformation allows for shift and scaling in x- and y- axes and rotation and its parameters are taken directly from the region parameters (see Section 6.2). 73
6 Single Shot Text Detection and Recognition
1.0 0.8 0.6 0.4 0.2 0.0
blank 0
10
T
i 20
r
e
d
n
s
30
40
----T-ii--r---e----d----n----e----s---s---Figure 6.4 Text recognition using Connectionist Temporal Classification. Input W × 32 region ˆ (top), CTC output W 4 × |A| as the most probable class at given column (middle) and the resulting sequence (bottom)
6.4 Text Recognition Given the normalized region from the source image, each region is associated with a sequence of characters or rejected as not text in the following process. The main problem one has to address in this step is the fact, that text regions of different sizes have to be mapped to character sequences of different lengths. Traditionally, the issue is solved by resizing the input to a fixed-sized matrix (typically 100 × 32 [47, 119]) and the input is then classified by either making every possible character sequence (i.e. every word) a separate class of its own [47, 41], thus requiring a list of all possible outputs in the training stage, or by having multiple independent classifiers, where each classifier predicts the character at predefined position [45]. Our model exploits a novel fully-convolutional network (see Table 6.1), which takes a 0 variable-width feature tensor W × H 0 × C as an input (W = wH h ) and outputs a matrix W ˆ 4 × |A|, where A is the alphabet (e.g. all English characters). The matrix height is fixed (it’s the number of character classes), but its width grows with the width of the source region and therefore with the length of the expected character sequence. As a result, a single classifier is used regardless of the position of the character in the word (in contrast to Jaderberg et al . [45], where there is an independent classifier for the character “A” as the first character in the word, an independent classifier for the character “A” as the second character in the word, etc). The model also does not require prior knowledge of all words to be detected in the training stage, in contrast to the separate class per character sequence formulation [47]. The model uses Connectionist Temporal Classification (CTC) [40, 119] to transform variable-width feature tensor into a conditional probability distribution over label sequences. The distribution is then used to select the most probable labelling sequence for the text region (see Figure 6.4). Let y = y1 , y2 , · · · , yn denote the vector of network outputs of length n from an alphabet A extended with a blank symbol “–”. 74
6.5 Training The probability of a path π is then given as p(π|y) =
n Y
yπi i ,
i=1
π ∈ Aˆn
(6.7)
Aˆ = A ∪ {−} where yπi i denotes the output probability of the network predicting the label πi at the position i (i.e. the output of the final softmax layer in Table 6.1). Let us further define a many-to-one mapping B : Aˆn 7→ A≤n , where Aˆ≤n is the set of all sequences of shorter or equal in length. The mapping B removes all blanks and repeated labels, which corresponds to outputting a new label every time the label prediction changes. For example, B(−ww − al − k) = B(wwaaa − l − k−) = walk
B(−f − oo − o − −d) = B(ffoo − ooo − d) = food
The conditional probability of observing the output sequence w is then given as X p(w|y) = p(π|y), w ∈ A≤n (6.8) π: B(π)=w
In training, an objective function that maximizes the log likelihood of target labellings p(w|y) is used [40]. In every training step, the probability p(wgt |y) of every text region in the mini-batch is efficiently calculated using a forward-backward algorithm similar to HMMs training [107] and the objective function derivatives are used to update network weights, using the standard back-propagation algorithm (wgt denotes the ground truth transcription of the text region). At test time, the classification output w∗ should be given by the most probable path p(w|y), which unfortunately is not tractable, and therefore we adapt the approximate approach [40] of taking the most probable labelling w∗ ≈ B argmax p(π|y) (6.9)
At the end of this process, each text region in the image has an associated content in the form of a character sequence, or it is rejected as not text when all the labels are blank. The model typically produces many different boxes for a single text area in the image, we therefore suppress overlapping boxes by a standard non-maxima suppression algorithm based on the text recognition confidence, which is the p(w∗ |y) normalized by the text length.
6.5 Training We pre-train the detection CNN using the SynthText dataset [41] (800, 000 synthetic scene images with multiple words per image) for 3 epochs, with weights initialized from ImageNet [109]. The recognition CNN is pre-trained on the Synthetic Word dataset [45] (9 million synthetic cropped word images) for 3 epochs, with weights randomly initialized from the N (0, 1) distribution. As the final step, we train both networks simultaneously for 3 epochs on a combined dataset consisting of the SynthText dataset, the Synthetic Word dataset, the ICDAR 75
6 Single Shot Text Detection and Recognition
Figure 6.5 End-to-end scene text recognition samples from the ICDAR 2013 dataset. Model output in red, ground truth in green. Note that in some cases (e.g. top-right) text is correctly recognized even though the bounding IoU with the ground truth is less than 80%, which would be required by the text localization protocol [53]. Best viewed zoomed in color
76
6.6 Experiments
Deep2Text [145] FASText (Chapter 5) StradVision [52] Jaderberg et al . [47] Gupta et al . [41] proposed method
end-to-end strong weak generic 0.81 0.79 0.77 0.77 0.63 0.54 0.81 0.79 0.67 0.86 0.89 0.86 0.77
word spotting strong weak generic 0.85 0.83 0.79 0.85 0.66 0.57 0.84 0.83 0.70 0.90 0.76 0.85 0.92 0.89 0.81
speed fps 1.0 1.0 ? *0.3 *0.4 *10.0
Table 6.2 ICDAR 2013 dataset - End-to-end scene text recognition accuracy (f-measure), depending on the lexicon size and whether digits are excluded from the evaluation (denoted as word spotting). Speed reports for methods using a GPU marked with an asterisk
FASText (Chapter 5) Stradvision [52] TextProposals [38, 47] proposed method
end-to-end strong weak generic 0.35 0.20 0.16 0.44 0.53 0.50 0.47 0.54 0.51 0.47
word spotting strong weak generic 0.37 0.21 0.16 0.46 0.56 0.52 0.50 0.58 0.53 0.51
speed fps 1.0 ? 0.2 *9.0
Table 6.3 ICDAR 2015 dataset - End-to-end scene text recognition accuracy (f-measure). Speed reports for methods using a GPU marked with an asterisk
2013 Training dataset [53] (229 scene images captured by a professional camera) and the ICDAR 2015 Training dataset [52] (1000 scene images captured by Google Glass). For every image, we randomly crop up to 30% of its width and height. We use standard Stochastic Gradient Descent with momentum 0.9 and learning rate 10−3 , divided by 10 after each epoch. One mini-batch takes about 500ms on a NVidia K80 GPU. Note that in preparation of this paper, we unfortunately did not have enough time and available computing resources to let the training process run for more epochs and to experiment with the traditional deep learning tricks; despite this, the model still performs well and compares favorably to the state of the art.
6.6 Experiments We trained our model once and then evaluated its accuracy on two standard datasets. We follow the standard ICDAR Robust Reading Competition protocol [53, 52] for the End to End task, where the objective is to localize and recognize all words in the image in a single step. In the ICDAR evaluation schema, each image in the test set is associated with a list of words (lexicon), which contains the words that the method should localize and recognize, as well as an increasing number of random “distractor” words. There are three sizes of lists provided with each image, depending how heavily contextualized their content is to the specific image: • strongly contextualized - 100 words specific to each image, contains all words in the image and the remaining words are “distractors” • weakly contextualized - all words in the testing set, same list for every image • generic - all words in the testing set plus 90k English words 77
6 Single Shot Text Detection and Recognition
Figure 6.6 End-to-end scene text recognition samples from the ICDAR 2015 dataset. Model output in red, ground truth in green. Best viewed zoomed in color
78
6.6 Experiments
Figure 6.7 All the images of the ICDAR 2013 Testing set where the proposed method fails to correctly recognize any text (i.e. images with 0% recall)
A word is considered as correctly recognized, when its Intersection-over-Union (IoU) with the ground truth is above 0.5 and the transcription is identical, using caseinsensitive comparison [52].
6.6.1 ICDAR 2013 dataset The ICDAR 2013 Dataset [53] is the most-frequently cited dataset for scene text evaluation. It consists of 255 testing images with 716 annotated words, the images were taken by a professional camera so text is typically horizontal and the camera is almost always aimed at it. The dataset is sometimes referred to as the Focused Scene Text dataset. The proposed model achieves state-of-the-art text recognition accuracy (see Table 6.2) for all 3 lexicon sizes. In the end-to-end set up, where all lexicon words plus all digits in an image should be recognized, the maximal f-measure it achieves is 0.89/0.86/0.77 for strongly, weakly and generally contextualized lexicons respectively. Each image is first resized to 544 × 544 pixels, the average processing time is 100ms per image on a NVidia K80 GPU for the whole pipeline. While training on the same training data, our model outperforms the combination of the state-of-the-art localization method of Gupta et al . [41] with the state-of-theart recognition method of Jaderberg et al . [47] by at least 3 per cent points on every measure, thus demonstrating the advantage of the joint training for the end-to-end task of our model. It is also more than 20 times faster than the method of Gupta et al . [41]. Let us further note that our model would not be considered as a state-of-the-art text localization method according to the text localization evaluation protocol, because at least a 80% intersection-over-union with bounding boxes created by human annotators is required. Our method in contrast does not always achieve the required 80% overlap, but it is still mostly able to recognize the text correctly even when the overlap is lower (see Figure 6.5). 79
6 Single Shot Text Detection and Recognition
Figure 6.8 Main failure modes on the ICDAR 2015 dataset. Blurred and noisy text (top), vertical text (top) and small text (bottom). Best viewed zoomed in color
We argue that evaluating methods purely on text localization accuracy without subsequent recognition is not very informative, because the text localization “accuracy” only aims to fit the way human annotators create bounding boxes around text, but it does not give any estimates on how well a text recognition phase would read text post a successful localization, which should be the prime objective of the text localization metrics. The main limitation of the proposed model are single characters or short snippets of digits and characters (see Figure 6.7), which may be partially caused by the fact that such examples are not very frequent in the training set.
6.6.2 ICDAR 2015 dataset The ICDAR 2015 dataset was introduced in the ICDAR 2015 Robust Reading Competition [52]. The images were collected by people who were wearing Google Glass devices and walking in Singapore, and then subsequently all images with text were selected and annotated. The images in the dataset were taken “not having text in mind”, therefore text is much smaller and the images contain a high variability of text fonts and sizes. They also include many realistic effects - e.g. occlusion, perspective distortion, blur or noise, so as a result the dataset is significantly more challenging than the ICDAR 2013 dataset (Section 6.6.1), which contains typically large horizontal text. The proposed model achieves state-of-the-art end-to-end text recognition accuracy (see Table 6.3 and Figure 6.6) for all 3 lexicon sizes. In our experiments, the average processing time was 110ms per image on a NVidia K80 GPU (the image is first resized to 608 × 608 pixels), which makes the proposed model 45 times faster than currently the best published method of Gomez et al . [38] The main failure mode of the proposed method is blurry or noisy text (see Figure 6.8), which are effects not present in the training set (Section 6.5). The method also often fails to detect small text (less than 15 pixels high), which again is due to the lack of such samples in the training stage.
80
7 Results 7.1 Applications 7.1.1 TextSpotter The methods presented in Chapter 4 and Chapter 5 were implemented into a software package TextSpotter. The package is implemented in C++ using STL libraries, which makes the code easily portable to Windows, Linux and mobile devices (see Section 7.1.2). The package uses OpenCV library [16] for low-level image processing and classification tasks. For demonstration purposes, an online demonstration website was set up1 , so that anyone can test methods’ capabilities using data of their own choice (see Figure 7.1).
Figure 7.1 Text Spotter on-line demo
On the website, the pipeline based on the ER detection (Chapter 4) is denoted as TextSpotter 2013, and the FASText pipeline (Chapter 5) is denoted as TextSpotter 2015.
7.1.2 Mobile Application for Translation Text Lens is a mobile application for Android devices which uses mobile phone camera to capture images and then detects and recognizes text using the scene text recognition method described in Chapter 4. Text Lens runs fully on the mobile device without any need for an Internet connection (the only time Internet connection is required is when offline dictionary is not sufficient and the user requires complex translation by an online service). 1
http://www.textspotter.org/
81
7 Results
(a)
(b)
(c)
Figure 7.2 The mobile application for automated text translation. Initial screen (a), text detection and recognition using our method (b) and its translation from English to Spanish (c)
When launched, a user is presented with a video stream from the mobile phone camera (in the same way that a standard photo-taking application works). When the user touches the screen, a photo is taken and it is sent to the scene text recognition algorithm. The algorithm detects all text areas present in the image, highlights them with a green rectangle and outputs their textual content, which then a user can chose to translate (see Figure 7.2). This is significantly faster than typing in the text and it is especially useful in situations when a user is not familiar with the alphabet of the text which he is trying to translate (for example a European tourist in China trying to translate Chinese text).
7.2 Available Code 7.2.1 FASText Detector We have released the source code for the keypoint detector and the text clustering algorithm presented in Chapter 5. The code is publicly available at GitHub 2 , including the trained classification model described in Section 5.3.
7.2.2 ER Detector The Extremal Region (ER) detection and classification using incrementally computed descriptors presented in Section 4.1.2 was re-implemented by Lluis Gomez into the OpenCV 3.0 library [9, 16, 36], the most popular computer vision library, solely based 2
https://github.com/MichalBusta/FASText
82
7.2 Available Code on the method’s description [91]. Although the code is completely independent from the one used in this thesis, the reported results [37] are comparable to the results presented in this thesis. Note that the reported results are not completely identical, because different training data and a different classifier was used.
83
8 Conclusion In this thesis, the problem of Scene Text Localization and Recognition was studied. Three different methods were proposed in the course of the research, each one advancing the state of the art and improving the accuracy. The first method (see Chapter 4) detects individual characters as Extremal Regions (ER), where the probability of each ER being a character is estimated using novel features with O(1) complexity and only ERs with locally maximal probability are selected across several image projections (channels) for the second stage, where the classification is improved using more computationally expensive features. Each character is recognized individually by an OCR classifier trained on synthetic fonts and the most probable character sequence is selected by dynamic programming in the very last stage of the processing, when context of each character in a text line is known. To our knowledge, the method was the first published method [88] to address the complete problem of scene text localization and recognition as a whole - all previous work in the literature focused solely on different subproblems, such as only detecting positions of text or only recognizing text previously found by a human annotator. The method’s low complexity also allowed for real-time image processing, as the first method in the literature [91]. The efficiency and the ability to automatically process an image or a video in real time and to output text in a standard digital format allowed creation of many practical applications, such as a mobile translation tool (see Chapter 7). The ER detector itself became the de-facto standard component of scene text localization methods - in the ICDAR 2015 Robust Reading Competition, 20 out of 22 participating methods including the winner use the ER detector [52]. Observing that characters in virtually any script consist of strokes, a novel FASText detector was proposed (see Chapter 5). The FASText detector finds text fragments (characters, parts of characters or character groups) irrespective to text orientation, scale or script and it is significantly faster and produces significantly less false detections than the commonly used ER detector. Additionally, an efficient text clustering algorithm based on text direction voting is proposed, which as well as the previous stages is scale- and rotation- invariant and supports a wide variety of scripts and fonts. The third method exploits a deep-learning model (Chapter 6), which is trained for both text detection and recognition in a single trainable pipeline. The method localizes and recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic data so it does not require obtaining expensive human annotations for training and it achieves state-of-the-art accuracy in the end-to-end text recognition on two standard datasets, whilst being an order of magnitude faster than the previous methods - the whole pipeline runs at 10 frames per second on a NVidia K80 GPU. We also show the advantage of the joint training for the end-to-end task, by outperforming the ad-hoc combination of the state-of-the-art localization and state-of-the-art recognition methods, while exploiting the same training data. The research has been supported by a prestigious Google PhD Fellowship, the conference papers describing the methods have been awarded the ICDAR 2013 Best Student Paper Award [93] and the ICDAR 2015 Best Paper Award [94]. The presented work currently has over 1200 citations in Google Scholar and 500 citations in the Web of Science. 84
Bibliography [1] http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/. 24 [2] http://rrc.cvc.uab.es/. 25, 26 [3] http://vision.ucsd.edu/~kai/svt/. 27 [4] https://vision.cornell.edu/se3/coco-text/. 28 [5] http://www.robots.ox.ac.uk/~vgg/data/text/. 28 [6] http://hunspell.github.io//. 28 [7] http://www.robots.ox.ac.uk/~vgg/data/scenetext/. 29 [8] http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/. 30 [9] http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html. 82 [10] J. Almaz´ an, A. Gordo, A. Forn´es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence, 36(12):2552–2566, 2014. 19, 20 [11] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern recognition, 13(2):111–122, 1981. 64 [12] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE transactions on pattern analysis and machine intelligence, 24(4):509–522, 2002. 18 [13] A. C. Berg, T. L. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondences. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 26–33. IEEE, 2005. 18 [14] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. PhotoOCR: reading text in uncontrolled conditions. ICCV, 2013. 19 [15] A. Bosch, A. Zisserman, and X. Muoz. Image classification using random forests and ferns. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1 –8, oct. 2007. 19, 21, 22 [16] G. Bradski and A. Kaehler. Learning OpenCV: Computer vision with the OpenCV library. ” O’Reilly Media, Inc.”, 2008. 81, 82 [17] M. Buˇsta, L. Neumann, and J. Matas. Fastext: Efficient unconstrained scene text detector. In 2015 IEEE International Conference on Computer Vision (ICCV 2015), pages 1206–1214, California, US, December 2015. IEEE. 6 [18] M. Busta, L. Neumann, and J. Matas. Deep TextSpotter: An end-to-end trainable scene text localization and recognition framework. in review. 6 [19] J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:679–698, 1986. 12 [20] X. Chen and A. L. Yuille. Detecting and reading text in natural scenes. CVPR, 2:366–373, 2004. 10, 11 [21] H. Cheng, X. Jiang, Y. Sun, and J. Wang. Color image segmentation: advances and prospects. Pattern Recognition, 34(12):2259 – 2281, 2001. 12, 31 85
Bibliography [22] A. Clavelli, D. Karatzas, and J. Llad´os. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pages 19–26. ACM, 2010. 14, 15, 16 [23] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang, D. Wu, and A. Ng. Text detection and character recognition in scene images with unsupervised feature learning. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 440 –445, sept. 2011. 11 [24] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines. Cambridge University Press, March 2000. 12, 21, 35 [25] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005. 18, 19 [26] T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. VISAPP, 05-08 February 2009, 2009. 18, 24 [27] P. Doll´ ar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, 2014. 22 [28] M. Donoser, H. Bischof, and S. Wagner. Using web search engines to improve text recognition. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pages 1 –4, dec. 2008. 13 [29] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In CVPR 2010, pages 2963 –2970, 6 2010. 12, 13, 60 [30] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015. 21 [31] A. Evgeniou and M. Pontil. Multi-task feature learning. Advances in neural information processing systems, 19:41, 2007. 13 [32] J. Fabrizio, B. Marcotegui, and M. Cord. Text segmentation in natural scenes using toggle-mapping. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 2373–2376. IEEE, 2009. 16, 17 [33] M. Fischler and R. Elschlager. The representation and matching of pictorial structures. Computers, IEEE Transactions on, C-22(1):67 – 92, jan. 1973. 21 [34] J. Friedman, T. Hastie, R. Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000. 63 [35] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015. 73 [36] L. Gomez and D. Karatzas. Multi-script text extraction from natural scenes. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 467–471. IEEE, 2013. 82 [37] L. G´omez and D. Karatzas. Scene text recognition: No country for old men? In Asian Conference on Computer Vision, pages 157–168. Springer International Publishing, 2014. 3, 83 [38] L. Gomez-Bigorda and D. Karatzas. Textproposals: A text-specific selective search algorithm for word spotting in the wild. arXiv preprint arXiv:1604.02619, 2016. 70, 77, 80 86
Bibliography [39] A. Gordo. Supervised mid-level features for word image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2956–2964, 2015. 20 [40] A. Graves, S. Fern´ andez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006. 20, 74, 75 [41] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 6, 14, 29, 70, 74, 75, 77, 79 [42] T. He, W. Huang, Y. Qiao, and J. Yao. Text-attentional convolutional neural network for scene text detection. IEEE Transactions on Image Processing, 25(6):2529–2541, 2016. 13 [43] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 20 [44] W. Huang, Y. Qiao, and X. Tang. Robust scene text detection with convolution neural network induced mser trees. In European Conference on Computer Vision, pages 497–511. Springer, 2014. 13 [45] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. In NIPS Deep Learning Workshop 2014, 2014. 6, 22, 28, 29, 74, 75 [46] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. In ICLR 2015, 2015. 23 [47] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1):1–20, 2016. 22, 28, 54, 55, 74, 77, 79 [48] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015. 73 [49] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In European conference on computer vision, pages 512–528. Springer, 2014. 3, 22 [50] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016. 71, 73 [51] L. Kang, Y. Li, and D. Doermann. Orientation robust text line detection in natural images. In CVPR 2014, pages 4034–4041, 2014. 13 [52] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, B. Andrew, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 robust reading competition. In ICDAR 2015, pages 1156–1160. IEEE, 2015. 7, 8, 10, 21, 25, 26, 53, 54, 70, 71, 77, 79, 80, 84 [53] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, et al. ICDAR 2013 robust reading competition. In ICDAR 2013, pages 1484–1493. IEEE, 2013. 8, 10, 13, 14, 17, 19, 25, 47, 48, 58, 67, 69, 70, 76, 77, 79 [54] B. Keni and S. Rainer. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 2008. 55 87
Bibliography [55] E. Kim, S. Lee, and J. Kim. Scene text extraction using focus of mobile camera. In Document Analysis and Recognition, 2009. ICDAR ’09. 10th International Conference on, pages 166 –170, july 2009. 17 [56] K. Kim, H. Byun, Y. Song, Y. Choi, S. Chi, K. Kim, and Y. Chung. Scene text extraction in natural scene images using hierarchical feature combining and verification. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 679 – 682 Vol.2, aug. 2004. 12 [57] S. Kim, S. Nowozin, P. Kohli, and C. D. Yoo. Higher-order correlation clustering for image segmentation. In Advances in neural information processing systems, pages 1530–1538, 2011. 13 [58] D. Kumar and A. Ramakrishnan. OTCYMIST: Otsu-canny minimal spanning tree for born-digital images. In Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on, pages 389–393. IEEE, 2012. 17 [59] J. Lafferty. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pages 282–289. Morgan Kaufmann, 2001. 12, 19 [60] S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using local affine regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1265–1278, 2005. 18 [61] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 13, 22 [62] C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu. Region-based discriminative feature pooling for scene text recognition. In CVPR 2014, pages 4050–4057, 2014. 18 [63] C.-Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 20 [64] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and C. Koch. Adaboost for text detection in natural scene. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 429 –434, sept. 2011. 11 [65] S. Lee, M. S. Cho, K. Jung, and J. H. Kim. Scene text extraction with edge constraint and text collinearity. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3983–3986. IEEE, 2010. 30 [66] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966. 18 [67] C. Li, X. Ding, and Y. Wu. Automatic text location in natural scene images. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 1069 –1073, 2001. 12 [68] L. Li and C. L. Tan. Character recognition under severe perspective distortion. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pages 1 –4, dec. 2008. 17 [69] M. Liang and X. Hu. Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3367–3375, 2015. 20 [70] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. arXiv preprint arXiv:1611.06779, 2016. 14 88
Bibliography [71] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. 28 [72] X. Lin. Reliable OCR solution for digital content re-mastering. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Dec. 2001. 8 [73] C.-L. Liu, K. Nakashima, H. Sako, and H. Fujisawa. Handwritten digit recognition: investigation of normalization and feature extraction techniques. Pattern Recognition, 37(2):265 – 279, 2004. 42 [74] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37. Springer, 2016. 14 [75] D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. IEEE, 1999. 18 [76] S. M. Lucas. Text locating competition results. Document Analysis and Recognition, International Conference on, 0:80–85, 2005. 11, 25 [77] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young. ICDAR 2003 robust reading competitions. In ICDAR 2003, page 682, 2003. 24 [78] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. arXiv preprint arXiv:1703.01086, 2017. 14 [79] C. Mancas-Thillou and B. Gosselin. Color text extraction with selective metricbased clustering. Computer Vision and Image Understanding, 107(12):97 – 107, 2007. ¡ce:title¿Special issue on color image processing¡/ce:title¿. 17 [80] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22:761–767, 2004. 5, 6, 13, 17, 67 [81] J. Matas and K. Zimmermann. A new class of learnable detectors for categorisation. In Image Analysis, volume 3540 of LNCS, pages 541–550. 2005. 13, 33 [82] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC 2012-23rd British Machine Vision Conference. BMVA, 2012. 30 [83] A. Mishra, K. Alahari, and C. Jawahar. Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision, pages 3040–3047, 2013. 30 [84] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up cues for scene text recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2687 –2694, june 2012. 19 [85] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISSAPP09, pages 331–340, 2009. 42 [86] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Trans. on Neural Networks, 12:181–201, 2001. 35 [87] R. Nagy, A. Dicker, and K. Meyer-Wegener. Neocr: A configurable dataset for natural image text recognition. In International Workshop on Camera-Based Document Analysis and Recognition, pages 150–163. Springer, 2011. 30 89
Bibliography [88] L. Neumann and J. Matas. A method for text localization and recognition in real-world images. In ACCV 2010, volume IV of LNCS 6495, pages 2067–2078, November 2010. 5, 6, 84 [89] L. Neumann and J. Matas. Estimating hidden parameters for text localization and recognition. In Proc. of 16th CVWW, pages 29–26, February 2011. 7 [90] L. Neumann and J. Matas. Text localization in real-world images using efficiently pruned exhaustive search. In ICDAR 2011, pages 687–691, 2011. 7 [91] L. Neumann and J. Matas. Real-time scene text localization and recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3538 –3545, 6 2012. 5, 6, 83, 84 [92] L. Neumann and J. Matas. On combining multiple segmentations in scene text recognition. In ICDAR 2013, pages 523–527, California, US, August 2013. IEEE. 6, 59 [93] L. Neumann and J. Matas. Scene text localization and recognition with oriented stroke detection. In 2013 IEEE International Conference on Computer Vision (ICCV 2013), pages 97–104, California, US, December 2013. IEEE. 7, 84 [94] L. Neumann and J. Matas. Efficient scene text localization and recognition with local character refinement. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pages 746–750, California, US, Aug 2015. IEEE. 6, 84 [95] L. Neumann and J. Matas. Real-time lexicon-free scene text localization and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 38(9):1872–1885, Sept 2016. 6 [96] A. Newell and L. Griffin. Multiscale histogram of oriented gradient descriptors for robust character recognition. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 1085 –1089, sept. 2011. 18 [97] W. Niblack. An introduction to digital image processing. Strandberg Publishing Company, Birkeroed, Denmark, Denmark, 1985. 10, 12 [98] A. Niculescu-Mizil and R. Caruana. boosting. In UAI, page 413, 2005. 35
Obtaining calibrated probabilities from
[99] T. Novikova, O. Barinova, P. Kohli, and V. Lempitsky. Large-lexicon attributeconsistent text recognition in natural images. In ECCV 2012, pages 752–765. Springer, 2012. 3, 19 [100] J. Ohya, A. Shio, and S. Akamatsu. Recognizing characters in scene images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16(2):214 – 220, feb 1994. 12 [101] N. Otsu. A threshold selection method from gray-level histograms. Automatica, 11(285-296):23–27, 1975. 17 [102] C. Pal, C. Sutton, and A. McCallum. Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 5, page V, may 2006. 18 [103] Y.-F. Pan, X. Hou, and C.-L. Liu. A robust system to detect and localize texts in natural scene images. In Document Analysis Systems, 2008. DAS ’08. The Eighth IAPR International Workshop on, pages 35 –42, sept. 2008. 11 90
Bibliography [104] Y.-F. Pan, X. Hou, and C.-L. Liu. Text localization in natural scene images based on conditional random field. In ICDAR 2009, pages 6–10. IEEE Computer Society, 2009. 12 [105] Y.-F. Pan, X. Hou, and C.-L. Liu. A hybrid approach to detect and localize texts in natural scene images. Image Processing, IEEE Transactions on, 20(3):800 –813, march 2011. 12 [106] W. K. Pratt. Digital Image Processing: PIKS Inside. John Wiley & Sons, Inc., New York, NY, USA, 3rd edition, 2001. 34 [107] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. 75 [108] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 14 [109] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016. 70, 71, 72, 75 [110] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 14, 70, 71, 73 [111] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1508–1515. IEEE, 2005. 56, 58, 65, 67 [112] E. Rosten, R. Porter, and T. Drummond. Faster and better: A machine learning approach to corner detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 32:105–119, 2010. 56 [113] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011. 59 [114] S. J. Russell, P. Norvig, J. F. Canny, J. M. Malik, and D. D. Edwards. Artificial intelligence: a modern approach, volume 2. Prentice hall Upper Saddle River, 2003. 19 [115] M. Sarifuddin. A new perceptually uniform color space with associated color similarity measure for contentbased image and video retrieval. In in Proc, pages 3–7, 2005. 17 [116] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37:297–336, 1999. 10, 35, 39 [117] J. Serra. Toggle mappings. From pixels to features, pages 61–72, 1989. 17 [118] A. Shahab, F. Shafait, and A. Dengel. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In ICDAR 2011, pages 1491–1496, 2011. 8, 10, 25, 48 [119] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for imagebased sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016. 20, 74 [120] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 14, 71 91
Bibliography [121] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In Computer Vision–ECCV 2012, pages 73–86. Springer, 2012. 19 [122] J. Sochman and J. Matas. Waldboost - learning for time constrained sequential detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 150 – 156 vol. 2, june 2005. 12 [123] H. Takahashi and M. Nakajima. Region graph based text extraction from outdoor images. In Information Technology and Applications, 2005. ICITA 2005. Third International Conference on, volume 1, pages 680 – 685 vol.1, july 2005. 12 [124] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan. Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE International Conference on Computer Vision, pages 4651–4659, 2015. 13 [125] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision, pages 56–72. Springer, 2016. 14 [126] A. Treuenfels. An efficient flood visit algorithm. C/C++ Users Journal, 12(8):39– 62, 1994. 59 [127] M. Varma and A. Zisserman. Classifying images of materials: Achieving viewpoint and illumination independence. In European Conference on Computer Vision, pages 255–271. Springer, 2002. 18 [128] M. Varma and A. Zisserman. Texture classification: Are filter banks necessary? In Computer vision and pattern recognition, 2003. Proceedings. 2003 IEEE computer society conference on, volume 2, pages II–691. IEEE, 2003. 18 [129] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016. 4, 7, 25, 28 [130] P. Viola and M. Jones. Fast and robust classification using asymmetric adaboost and a detector cascade. Advances in Neural Information Processing System, 14, 2001. 10 [131] T. Voj´ıˇr and J. Matas. The enhanced flock of trackers. In Registration and Recognition in Images and Videos, pages 113–136. Springer, 2014. 6, 55 [132] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1457 –1464, nov. 2011. 21, 51 [133] K. Wang and S. Belongie. Word spotting in the wild. In K. Daniilidis, P. Maragos, and N. Paragios, editors, ECCV 2010, volume 6311 of Lecture Notes in Computer Science, pages 591–604. Springer Berlin / Heidelberg, 2010. 21, 27 [134] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3304–3308, 2012. 3, 22, 27, 51, 52 [135] J. Weinman, E. Learned-Miller, and A. Hanson. Scene text recognition using similarity and a lexicon with sparse belief propagation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(10):1733 –1746, oct. 2009. 18 [136] J. J. Weinman, Z. Butler, D. Knoll, and J. Feild. Toward integrated scene text reading. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(2):375–387, 2014. 48, 50 92
Bibliography [137] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988. 20 [138] Wikipedia. Google glass. https://en.wikipedia.org/wiki/Google_Glass, 2016. [Online; accessed 31-December-2016]. 26 [139] C. Wolf and J.-M. Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit., 8:280–296, August 2006. 8, 9, 47 [140] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015. 20 [141] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1083 –1090, june 2012. 13, 30 [142] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR 2014, pages 4042–4049, 2014. 19 [143] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao. Multi-orientation scene text detection with adaptive clustering. IEEE transactions on pattern analysis and machine intelligence, 37(9):1930–1937, 2015. 13, 54 [144] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao. Robust text detection in natural scene images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(5):970–983, May 2014. 13, 17, 47, 48, 54, 59, 69 [145] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao. Robust text detection in natural scene images. IEEE transactions on pattern analysis and machine intelligence, 36(5):970–983, 2014. 77 [146] M. Yokobayashi and T. Wakahara. Segmentation and recognition of characters in scene images using selective binarization in color space and gat correlation. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 167 – 171 Vol. 1, aug.-1 sept. 2005. 17 [147] A. Zamberletti, I. Gallo, and L. Noce. Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. In 1st International Workshop on Robust Reading, IWRR 2014 (in conjunction with ACCV 2014), Singapore, November 2, 2014, 2014. 69 [148] J. Zhang and R. Kasturi. Character energy and link energy-based text extraction in scene images. In ACCV 2010, volume II of LNCS 6495, pages 832–844, November 2010. 13 [149] S. Zhu and R. Zanibbi. A text detection system for natural scenes with convolutional feature learning and cascaded classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 11 [150] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision, pages 391–405. Springer, 2014. 22
93