OCR of books by word recognition
First Claim
1. A computer-implemented method of image-to-text processing, comprising the steps of:
- acquiring an image of a document having words written thereon;
segmenting said image into areas, each area containing one of said words;
using said areas, defining a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words;
comparing said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words;
generating respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document; and
outputting a coded version of said document.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed embodiments of the invention provide automated global optimization methods and systems of OCR, tailored to each document being digitized. A document-specific database is created from an OCR scan of a document of interest, which contains an exhaustive listing of words in the document. Images of each word, taken from all the fonts encountered, are entered into the database and mapped to a corresponding textual representation. After entry of a first instance of an image of a word written in a particular font, each new occurrence of the word in that font can be quickly recognized by image processing techniques. The disclosed methods and systems may be used in conjunction with adaptive character recognition training and word recognition training of the OCR engines.
-
Citations
20 Claims
-
1. A computer-implemented method of image-to-text processing, comprising the steps of:
-
acquiring an image of a document having words written thereon; segmenting said image into areas, each area containing one of said words; using said areas, defining a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words; comparing said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words; generating respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document; and outputting a coded version of said document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
- 9. A computer software product for image-to-text processing, including a computer storage medium in which computer program instructions are stored, which instructions, when executed by a computer, cause the computer to acquire an image of a document having words written thereon, segment said image into areas, each area containing one of said words, using said areas, define a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words, compare said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words, generate respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document, and output a coded version of said document.
-
15. A data processing system for image-to-text processing, comprising:
-
a processor connectable to an optical scanner; and a memory accessible by said processor storing programs and data objects therein, said processor cooperative with said optical scanner to acquire an image of a document having words written thereon, segment said image into areas, each area containing one of said words, and using said areas, to define a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words, compare said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words, generate respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document, and to output a coded version of said document. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification