Optical character recognition employing deep learning with machine generated training data
First Claim
1. A computer-implemented method for training a computerized deep learning system utilized by an optical character recognition system comprising the computer-implemented operations of:
- generating a plurality of synthetic text segments, by programmatically converting each of a plurality of text strings to a corresponding image, each text string and corresponding image forming a synthetic image/text tuple;
generating a plurality of real-life text segments by processing from a corpus of document images, at least a subset of images from the corpus, with a plurality of OCR programs, each of the OCR programs processing each image from the subset to produce a real-life image/text tuple, and at least some of the OCR programs producing a confidence value corresponding to each real-life image/text tuple, and wherein each OCR program is characterized by a conversion accuracy substantially below a desired accuracy for an identified domain;
storing the synthetic image/text tuple and the real-life image/text tuple to data storage as training data in a format accessible by the computerized deep learning system for training; and
training the computerized deep learning system with the training data.
3 Assignments
0 Petitions
Accused Products
Abstract
An optical character recognition system employs a deep learning system that is trained to process a plurality of images within a particular domain to identify images representing text within each image and to convert the images representing text to textually encoded data. The deep learning system is trained with training data generated from a corpus of real-life text segments that are generated by a plurality of OCR modules. Each of the OCR modules produces a real-life image/text tuple, and at least some of the OCR modules produce a confidence value corresponding to each real-life image/text tuple. Each OCR module is characterized by a conversion accuracy substantially below a desired accuracy for an identified domain. Synthetically generated text segments are produced by programmatically converting text strings to a corresponding image where each text string and corresponding image form a synthetic image/text tuple.
-
Citations
36 Claims
-
1. A computer-implemented method for training a computerized deep learning system utilized by an optical character recognition system comprising the computer-implemented operations of:
-
generating a plurality of synthetic text segments, by programmatically converting each of a plurality of text strings to a corresponding image, each text string and corresponding image forming a synthetic image/text tuple; generating a plurality of real-life text segments by processing from a corpus of document images, at least a subset of images from the corpus, with a plurality of OCR programs, each of the OCR programs processing each image from the subset to produce a real-life image/text tuple, and at least some of the OCR programs producing a confidence value corresponding to each real-life image/text tuple, and wherein each OCR program is characterized by a conversion accuracy substantially below a desired accuracy for an identified domain; storing the synthetic image/text tuple and the real-life image/text tuple to data storage as training data in a format accessible by the computerized deep learning system for training; and training the computerized deep learning system with the training data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computerized optical character recognition system comprising:
-
a computerized deep learning system trained to process a plurality of encoded images within a particular domain to identify images representing text within each encoded image and converting the encoded images representing text to textually encoded data; data storage for storing the encoded images representing text and textually encoded data; wherein the computerized deep learning system is trained with training data generated from a corpus of, real-life text segments generated by processing from a corpus of encoded document images, at least a subset of encoded images from the corpus, with a plurality of OCR modules, each of the OCR modules processing each encoded image from the corpus to produce a real-life image/text tuple, and at least some of the OCR modules producing a confidence value corresponding to each real-life image/text tuple, and wherein each OCR module is characterized by an conversion accuracy substantially below a desired accuracy for an identified domain; and synthetically generated text segments, generated by programmatically converting each of a plurality of text strings to a corresponding encoded image, each text string and corresponding encoded image forming a synthetic image/text tuple. - View Dependent Claims (19)
-
-
20. A computerized system for training a computerized deep learning system utilized by an optical character recognition system comprising:
-
a processor configured to execute instructions that when executed cause the processor to; generate a plurality of synthetic text segments, by programmatically converting each of a plurality of text strings to a corresponding image, each text string and corresponding image forming a synthetic image/text tuple; and generate a plurality of real-life text segments by processing from a corpus of document images, at least a subset of images from the corpus, with a plurality of OCR modules, each of the OCR modules processing each image from the subset to produce a real-life image/text tuple, and at least some of the OCR modules producing a confidence value corresponding to each real-life image/text tuple, and wherein each OCR module is characterized by an conversion accuracy substantially below a desired accuracy for an identified domain; and data storage, operatively coupled to the processor, for storing the synthetic image/text tuple and the real-life image/text tuple as training data in a format accessible by the deep learning system for training, wherein the computerized system employs the training data to train the deep learning system. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
Specification