An Overview of the Tesseract OCR Engine

An Overview of the Tesseract OCR Engine Ray Smith Google Inc. [email protected] Abstract 2. Architecture The Tesseract OCR engine, as was the HP Research Since HP had independently-developed page layout Prototype in the UNLV Fourth Annual Test of OCR analysis technology that was used in products, (and Accuracy[1], is described in a comprehensive therefore not released for open-source) Tesseract never overview. Emphasis is placed on aspects that are novel needed its own page layout analysis. Tesseract or at least unusual in an OCR engine, including in therefore assumes that its input is a binary image with particular the line finding, features/classification optional polygonal text regions defined. methods, and the adaptive classifier. Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and possibly remain so even now. The first step is 1. Introduction – Motivation and History a connected component analysis in which outlines of the components are stored. This was a computationally Tesseract is an open-source OCR engine that was expensive design decision at the time, but had a developed at HP between 1984 and 1994. Like a super- significant advantage: by inspection of the nesting of nova, it appeared from nowhere for the 1995 UNLV outlines, and the number of child and grandchild Annual Test of OCR Accuracy [1], shone brightly with outlines, it is simple to detect inverse text and its results, and then vanished back under the same recognize it as easily as black-on-white text. Tesseract cloak of secrecy under which it had been developed. was probably the first OCR engine able to handle Now for the first time, details of the architecture and white-on-black text so trivially. At this stage, outlines algorithms can be revealed. are gathered together, purely by nesting, into Blobs. Tesseract began as a PhD research project [2] in HP Blobs are organized into text lines, and the lines and Labs, Bristol, and gained momentum as a possible regions are analyzed for fixed pitch or proportional software and/or hardware add-on for HP’s line of text. Text lines are broken into words differently flatbed scanners. Motivation was provided by the fact according to the kind of character spacing. Fixed pitch that the commercial OCR engines of the day were in text is chopped immediately by character cells. their infancy, and failed miserably on anything but the Proportional text is broken into words using definite best quality print. spaces and fuzzy spaces. After a joint project between HP Labs Bristol, and Recognition then proceeds as a two-pass process. In HP’s scanner division in Colorado, Tesseract had a the first pass, an attempt is made to recognize each significant lead in accuracy over the commercial word in turn. Each word that is satisfactory is passed to engines, but did not become a product. The next stage an adaptive classifier as training data. The adaptive of its development was back in HP Labs Bristol as an classifier then gets a chance to more accurately investigation of OCR for compression. Work recognize text lower down the page. concentrated more on improving rejection efficiency Since the adaptive classifier may have learned than on base-level accuracy. At the end of this project, something useful too late to make a contribution near at the end of 1994, development ceased entirely. The the top of the page, a second pass is run over the page, engine was sent to UNLV for the 1995 Annual Test of in which words that were not recognized well enough OCR Accuracy[1], where it proved its worth against are recognized again. the commercial engines of the time. In late 2005, HP A final phase resolves fuzzy spaces, and checks released Tesseract for open source. It is now available alternative hypotheses for the x-height to locate small- at http://code.google.com/p/tesseract-ocr. cap text. 3. Line and Word Finding Fig.1 shows an example of a line of text with a fitted baseline, descender line, meanline and ascender 3.1. Line Finding line. All these lines are “parallel” (the y separation is a constant over the entire length) and slightly curved. The line finding algorithm is one of the few parts of The ascender line is cyan (prints as light gray) and the Tesseract that has previously been published [3]. The black line above it is actually straight. Close inspection line finding algorithm is designed so that a skewed shows that the cyan/gray line is curved relative to the page can be recognized without having to de-skew, straight black line above it. thus saving loss of image quality. The key parts of the process are blob filtering and line construction. 3.3. Fixed Pitch Detection and Chopping Assuming that page layout analysis has already provided text regions of a roughly uniform text size, a Tesseract tests the text lines to determine whether simple percentile height filter removes drop-caps and they are fixed pitch. Where it finds fixed pitch text, vertically touching characters. The median height Tesseract chops the words into characters using the approximates the text size in the region, so it is safe to pitch, and disables the chopper and associator on these filter out blobs that are smaller than some fraction of words for the word recognition step. Fig. 2 shows a the median height, being most likely punctuation, typical example of a fixed-pitch word. diacritical marks and noise. The filtered blobs are more likely to fit a model of non-overlapping, parallel, but sloping lines. Sorting and processing the blobs by x-coordinate makes it possible to assign blobs to a unique text line, while Fig. 2. A fixed-pitch chopped word. tracking the slope across the page, with greatly reduced danger of assigning to an incorrect text line in the 3.4. Proportional Word Finding presence of skew. Once the filtered blobs have been assigned to lines, a least median of squares fit [4] is Non-fixed-pitch or proportional text spacing is a used to estimate the baselines, and the filtered-out highly non-trivial task. Fig. 3 illustrates some typical blobs are fitted back into the appropriate lines. problems. The gap between the tens and units of The final step of the line creation process merges ‘11.9%’ is a similar size to the general space, and is blobs that overlap by at least half horizontally, putting certainly larger than the kerned space between ‘erated’ diacritical marks together with the correct base and and ‘junk’. There is no horizontal gap at all between correctly associating parts of some broken characters. the bounding boxes of ‘of’ and ‘financial’. Tesseract solves most of these problems by measuring gaps in a 3.2. Baseline Fitting limited vertical range between the baseline and mean line. Spaces that are close to the threshold at this stage Once the text lines have been found, the baselines are made fuzzy, so that a final decision can be made are fitted more precisely using a quadratic spline. This after word recognition. was another first for an OCR system, and enabled Tesseract to handle pages with curved baselines [5], which are a common artifact in scanning, and not just at book bindings. The baselines are fitted by partitioning the blobs into groups with a reasonably continuous displacement Fig. 3. Some difficult word spacing. for the original straight baseline. A quadratic spline is fitted to the most populous partition, (assumed to be 4. Word Recognition the baseline) by a least squares fit. The quadratic spline has the advantage that this calculation is reasonably Part of the recognition process for any character stable, but the disadvantage that discontinuities can recognition engine is to identify how a word should be arise when multiple spline segments are required. A segmented into characters. The initial segmentation more traditional cubic spline [6] might work better. output from line finding is classified first. The rest of the word recognition step applies only to non-fixed- pitch text. Fig. 1. An example of a curved fitted baseline. 4.1 Chopping Joined Characters When the A* segmentation search was first implemented in about 1989, Tesseract’s accuracy on While the result from a word (see section 6) is broken characters was well ahead of the commercial unsatisfactory, Tesseract attempts to improve the result engines of the day. Fig. 5 is a typical example. An by chopping the blob with worst confidence from the essential part of that success was the character character classifier. Candidate chop points are found classifier that could easily recognize broken characters. from concave vertices of a polygonal approximation [2] of the outline, and may have either another concave 5. Static Character Classifier vertex opposite, or a line segment. It may take up to 3 pairs of chop points to successfully separate joined 5.1. Features characters from the ASCII set. An early version of Tesseract used topological features developed from the work of Shillman et. al. [7-8] Though nicely independent of font and size, these features are not robust to the problems found in real- life images, as Bokser [9] describes. An intermediate idea involved the use of segments of the polygonal approximation as features, but this approach is also not Fig. 4. Candidate chop points and chop. robust to damaged characters. For example, in Fig. 6(a), the right side of the shaft is in two main pieces, Fig. 4 shows a set of candidate chop points with but in Fig. 6(b) there is just a single piece. arrows and the selected chop as a line across the outline where the ‘r’ touches the ‘m’.

Recommended publications Download Fedra Sans Bold

Download fedra sans bold click here to download Download Fedra Sans Std Bold For Free, View Sample Text, Rating And More On www.doorway.ru Download Fedra Sans Bold For Free, View Sample Text, Rating And More On www.doorway.ru Download Fedra Sans Expert Bold For Free, View Sample Text, Rating And More On www.doorway.ru Download Fedra Sans SC Bold For Free, View Sample Text, Rating And More On www.doorway.ru Download fedra sans std bold font with bold style. Download free fonts for Mac, Windows and Linux. All fonts are in TrueType format. Download fedra sans std bold font for Windows, Linux and Mac free at www.doorway.ru - database of around free OpenType and TrueType. Fedra Sans Book ItalicMacromedia Fontographer 4. 1 Fedra Sans Book ItalicFedra Sans Book ItalicMacromedia Fontographer 4. 1 Fedra Sans Pro-Bold. Download OTF. Similar. Fedra Sans Pro-Bold Italic · Fedra Sans Pro Light Light · Fedra Sans Pro Normal Normal · Fedra Sans Pro-Book. Fedra Sans was originally commissioned by Paris-based Ruedi Baur Integral Design and developed as a corporate font for Bayerische Rück, a German. Fedra Sans: Fedra Sans is a contemporary sans serif, highly legible, font Fedra Sans Medium Italic px Fedra Sans Bold Italic px . Is there any reason to make new fonts when there are so many already available for downloading? Fedra Sans is a typeface designed by Peter Bil'ak, and is available for Desktop. Try, buy and download these fonts now! Bold SC Italic. Büroflächen. Bold TF. Font Fedra Sans Std Normal font download free at www.doorway.ru, the largest collection of cool fonts for Fedra Sans Std Bold Italic font.

Multimedia Foundations Glossary of Terms Chapter 8 – Text

Multimedia Foundations Glossary of Terms Chapter 8 – Text Ascender Any part of a lowercase character that extends above the x-height, such as in the vertical stem of the letter b or h. Baseline And imaginary plane where the bottom edge of each character’s main body rests. Baseline Shift Refers to shifting the base of certain characters (up or down) to a new position. Capline An imaginary line denoting the tops of uppercase letters. Counter The enclosed or partially enclosed open area in letters such as O and G. Descender Any part of a character that extends below the baseline; such as in the bottom stroke of a y or p. Flush Left The alignment of text along a common left-edged line. Font Family A collection of related fonts – all of the bolds, italics, and so forth, in their various sizes. Gridline A matrix of evenly spaced vertical and horizontal lines that are superimposed overtop of the design window as a visual aid for aligning objects. Justification The term used when both the left and right edges of a paragraph are vertically aligned. Kerning A technique that selectively varies the amount of space between a single pair of letters and accounts for letter shape; allowing letters like A and V to extend into one another’s virtual blocks. Leading A term used to define the amount of space between vertically adjacent lines of text. Legibility Refers to a typeface’s characteristics and can change depending on font size. The more legible a typeface, the easier it is at a glance to distinguish and identify letters, numbers, and symbols.

National Rappel Operations Guide

National Rappel Operations Guide 2019 NATIONAL RAPPEL OPERATIONS GUIDE USDA FOREST SERVICE National Rappel Operations Guide i Page Intentionally Left Blank National Rappel Operations Guide ii Table of Contents Table of Contents . ii USDA Forest Service - National Rappel Operations Guide Approval . iv USDA Forest Service - National Rappel Operations Guide Overview . vi USDA Forest Service Helicopter Rappel Mission Statement . viii NROG Revision Summary . x Introduction . 1—1 Administration . 2—1 Rappel Position Standards . 2—6 Rappel and Cargo Letdown Equipment . 4—1 Rappel and Cargo Letdown Operations . 5—1 Rappel and Cargo Operations Emergency Procedures . 6—1 Documentation .

Optical Character Recognition - a Combined ANN/HMM Approach

Optical Character Recognition - A Combined ANN/HMM Approach Dissertation submitted to the Department of Computer Science Technical University of Kaiserslautern for the fulﬁllment of the requirements for the doctoral degree Doctor of Engineering (Dr.-Ing.) by Sheikh Faisal Rashid Dean: Prof. Dr. Klaus Schneider Thesis supervisors: Prof. Dr. Thomas Breuel, TU Kaiserslautern Prof. Dr. Andreas Dengel, TU Kaiserslautern Chair of supervisory committee: Prof. Dr. Karsten Berns, TU Kaiserslautern Kaiserslautern, 11 July, 2014 D 386 Abstract Optical character recognition (OCR) of machine printed text is ubiquitously considered as a solved problem. However, error free OCR of degraded (broken and merged) and noisy text is still challenging for modern OCR systems. OCR of degraded text with high accuracy is very important due to many applications in business, industry and large scale document digitization projects. This thesis presents a new OCR method for degraded text recognition by introducing a combined ANN/HMM OCR approach. The approach provides signiﬁcantly better performance in comparison with state-of-the-art HMM based OCR methods and existing open source OCR systems. In addition, the thesis introduces novel applications of ANNs and HMMs for document image preprocessing and recognition of low resolution text. Furthermore, the thesis provides psychophysical experiments to determine the eﬀect of letter permutation in visual word recognition of Latin and Cursive script languages. HMMs and ANNs are widely employed pattern recognition paradigms and have been used in numerous pattern classiﬁcation problems. This work presents a simple and novel method for combining the HMMs and ANNs in application to segmentation free OCR of degraded text. HMMs and ANNs are powerful pattern recognition strategies and their combination is interesting to improve current state-of-the-art research in OCR.

National Diploma in Calligraphy Helpful Hints for FOUNDATION Diploma Module A

National Diploma in Calligraphy Helpful hints for FOUNDATION Diploma Module A THE LETTERFORM ANALYSIS “In A4 format make an analysis of the letter-forms of an historical manuscript which reflects your chosen basic hand. Your analysis should include x-height, letter formation and construction, heights of ascenders and descenders, etc. This can be in the form of notes added to enlarged photocopies of a relevant historical manuscript, together with your own lettering studies” At this first level, you will be working with one basic hand only and its associated capitals. This will be either Foundational (Roundhand) in which case study the Ramsey Psalter, or Formal Italic, where you can study a hand by Arrighi or Francisco Lucas, or other fine Italian scribe. Find enlarged detailed illustrations from ‘Historical Scripts by Stan Knight, or A Book of Scripts, by A Fairbank, or search the internet. Stan Knight’s book is the ‘bible’ because the enlargements are clear and at least 5mm or larger body height – this is the ideal. Show by pencil lines & measurements on the enlargement how you have worked out the pen angle, nib-widths, ascender & descender heights and shape of O, arch formations etc, use a separate sheet to write down this information, perhaps as numbered or bullet points, such as: 1. Pen angle 2. 'x'height 3. 'o'form 4, 5,6 Number of strokes to each letter, their order, direction: - make a general observation, and then refer the reader to the alphabet (s) you will have written (see below), on which you will have added the stroke order and directions to each letter by numbered pencil arrows.

Typographic Terms Alphabet the Characters of a Given Language, Arranged in a Traditional Order; 26 Characters in English

Typographic Terms alphabet The characters of a given language, arranged in a traditional order; 26 characters in English. ascender The part of a lowercase letter that rises above the main body of the letter (as in b, d, h). The part that extends above the x-height of a font. bad break Refers to widows or orphans in text copy, or a break that does not make sense of the phrasing of a line of copy, causing awkward reading. baseline The imaginary line upon which text rests. Descenders extend below the baseline. Also known as the "reading line." The line along which the bases of all capital letters (and most lowercase letters) are positioned. bleed An area of text or graphics that extends beyond the edge of the page. Commercial printers usually trim the paper after printing to create bleeds. body type The specific typeface that is used in the main text break The place where type is divided; may be the end of a line or paragraph, or as it reads best in display type. bullet A typeset character (a large dot or symbol) used to itemize lists or direct attention to the beginning of a line. (See dingbat.) cap height The height of the uppercase letters within a font. (See also cap line.) caps and small caps The typesetting option in which the lowercase letters are set as small capital letters; usually 75% the height of the size of the innercase. Typographic Terms character A symbol in writing. A letter, punctuation mark or figure. character count An estimation of the number of characters in a selection of type.

Documaker Server System Reference, Version 11.3

Start Documaker Documaker Server System Reference version 11.3 Skywire Software, L.L.C. Phone: (U. S.) 972.377.1110 3000 Internet Boulevard (EMEA) +44 (0) 1372 366 200 Suite 200 FAX: (U. S.) 972.377.1109 Notice Frisco, Texas 75034 (EMEA) +44 (0) 1372 366 201 www.skywiresoftware.com Support: (U. S.) 866.4SKYWIRE (EMEA) +44 (0) 1372 366 222 [email protected] PUBLICATION COPYRIGHT NOTICE Copyright © 2008 Skywire Software, L.L.C. All rights reserved. Printed in the United States of America. This publication contains proprietary information which is the property of Skywire Software or its subsidiaries. This publication may also be protected under the copyright and trade secret laws of other countries. TRADEMARKS Skywire® is a registered trademark of Skywire Software, L.L.C. Docucorp®, its products (Docucreate™, Documaker™, Docupresentment™, Docusave®, Documanage™, Poweroffice®, Docutoolbox™, and Transall™) , and its logo are trademarks or registered trademarks of Skywire Software or its subsidiaries. The Docucorp product modules (Commcommander™, Docuflex®, Documerge®, Docugraph™, Docusolve®, Docuword™, Dynacomp®, DWSD™, DBL™, Freeform®, Grafxcommander™, Imagecreate™, I.R.I.S. ™, MARS/NT™, Powermapping™, Printcommander®, Rulecommander™, Shuttle™, VLAM®, Virtual Library Access Method™, Template Technology™, and X/HP™ are trademarks of Skywire Software or its subsidiaries. Skywire Software (or its subsidiaries) and Mynd Corporation are joint owners of the DAP™ and Document Automation Platform™ product trademarks. Docuflex is based in part on the work of Jean-loup Gailly and Mark Adler. Docuflex is based in part on the work of Sam Leffler and Silicon Graphic, Inc. Copyright © 1988-1997 Sam Leffler. Copyright © 1991-1997 Silicon Graphics, Inc. Docuflex is based in part on the work of the Independent JPEG Group.

Special Characters in Aletheia

Special Characters in Aletheia Last Change: 28 May 2014 The following table comprises all special characters which are currently available through the virtual keyboard integrated in Aletheia. The virtual keyboard aids re-keying of historical documents containing characters that are mostly unsupported in other text editing tools (see Figure 1). Figure 1: Text input dialogue with virtual keyboard in Aletheia 1.2 Due to technical reasons, like font definition, Aletheia uses only precomposed characters. If required for other applications the mapping between a precomposed character and the corresponding decomposed character sequence is given in the table as well. When writing to files Aletheia encodes all characters in UTF-8 (variable-length multi-byte representation). Key: Glyph – the icon displayed in the virtual keyboard. Unicode – the actual Unicode character; can be copied and pasted into other applications. Please note that special characters might not be displayed properly if there is no corresponding font installed for the target application. Hex – the hexadecimal code point for the Unicode character Decimal – the decimal code point for the Unicode character Description – a short description of the special character Origin – where this character has been defined Base – the base character of the special character (if applicable), used for sorting Combining Character – combining character(s) to modify the base character (if applicable) Pre-composed Character Decomposed Character (used in Aletheia) (only for reference) Combining Glyph

2.1 Typography

Working With Type FUN ROB MELTON BENSON POLYTECHNIC HIGH SCHOOL WITH PORTLAND, OREGON TYPE Points and picas If you are trying to measure something very short or very thin, then inches are not precise enough. Originally English printers devised picas to precisely measure the width of type and points to precise- ly measure the height of type. Now those terms are used interchangeably. There are 12 points in one pica, 6 picas in one inch — or 72 points in one inch. This is a 1-point line (or rule). 72 of these would be one inch thick. This is a 12-point rule. It is 1 pica thick. Six of these would be one inch thick. POINTS PICAS INCHES Thickness of rules I Lengths of rules Lengths of stories I Sizes of type (headlines, text, IWidths of text, photos, cutlines, IDepths of photos and ads cutlines, etc.) gutters, etc. (though some publications use IAll measurements smaller than picas for photo depths) a pica. Type sizes Type is measured in points. Body type is 7–12 point type, while display type starts at 14 point and goes to 127 point type. Traditionally, standard point sizes are 14, 18, 24, 30, 36, 42, 48, 54, 60 and 72. Using a personal computer, you can create headlines in one-point increments beginning at 4 point and going up to 650 point. Most page designers still begin with these standard sizes. The biggest headline you are likely to see is a 72 pt. head and it is generally reserved for big stories on broadsheet newspapers.

Typography Height

THIS MONTH POInts OF VIEW a Serif Sans serif b Ascender Serif Typography Height Typography is the art and technique of arranging type. Like a Serif Descender person’s speaking style and skill, the quality of our treatment of Figure 1 | Typefaces. (a) The anatomy of letterform for serif (Garamond) letters on a page can influence how people respond to our mes- and sans serif (Univers) type both set at 58 point. (b) Four of the most sage. It is an essential act of encoding and interpretation, linking readily available fonts. what we say to what people see. Typography has been known to affect perception of credibility. line and paragraph settings (Fig. 2b). The relative scale of white In one study, identical job resumes printed using different type- space in Figure 2b makes the hierarchy of the content apparent. faces were sent out for review. Resumes with typefaces deemed Differentially aligning the paragraph text and bulleted list, when appropriate for a given industry resulted in applicants being con- allowed, differentiates the content. sidered more knowledgeable, mature, experienced, professional, To achieve meaningfully spaced text, use the ‘space before’ and believable and trustworthy than when less appropriate typefaces ‘space after’ settings instead of extra carriage returns. Find the were used1. In this case, picking the right typeface can help some- settings under Font menu > Paragraphs (PowerPoint) or Format one’s chances of landing a job. menu > Paragraphs (Word). The paragraph text in Figure 2b is set The term typeface is frequently conflated with font; Arial is a with 5 point space after it; the bulleted list has 3 point space after ‘typeface’ that may include roman, bold and italic ‘fonts’.

User's Manual ZA-200 / ZA-250

USERS MANUAL ZA-200 MULTI-FONT ZA-250 MULTI-FONT ZR 80825018 ZA-200 ZA-250 USERSMANUAL NOTINTENDEDFORSALE VDE Statement This device carries the VDE RFI protection mark to certify that it meets the radio interference requirements of the Postal Ordinance No. 243/1991. The additional marking “Vfg. 243/P” expresses in short form that this is a peripheral device (not operable alone) which only individually meets the Class B RFI requirements in accordance with the DIN VDE 0878 part 3/11.89and the PostaI Ordinance 243/ 1991, If this device is operated in conjunction with other devices within a set-up, in order to take advantage of a “General (Operating) Authorization” in accordance with the Postal Ordinance 243/1991, the complete set-up must comply with the Class B limits in accordance with the DIN VDE 0878 part 3/11.89, as well as satisfy the preconditions in accordance with $2 and the prerequisites in accordance with $3 of the Postal Ordinance 243/1991. As a rule, this is only fulfilled when the device is operated in a set-up which has been type-tested and provided with a VDE RFI protection mark with the additional marking “Vfg 243”. Machine Noise Information Ordinance 3. GSGV, January 18, 1991: The sound pressure level at the operator position is equal or less than 70 dB(A) according to 1S0 7779. The above statement applies only to printers marketed in Germany. Trademark Acknowledgements ZA-200/250, FR-10/15, LC-200 Color, LC-10 Color, LZ9~X9CL, IS-8XL, IP-128XL, SF-1ODMIU 15DMII, SF-1ORMIV15RMII,PT-10XM/15XM: StarMlcronics Co., Ltd.