Shape clustering in post optical character recognition processing
First Claim
1. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
- one or more computers, the one or more computers implementing;
a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process;
a cluster processing engine operable to determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster; and
an output processing engine operable to;
produce a modified OCR output from the received OCR output, wherein the output processing engine, in producing the modified OCR output, assigns one or more characters for each cluster processed by the cluster processing engine to each clip image belonging to the cluster,select a cluster image of a cluster having a high confidence score as a cluster image template,align the cluster image template with a plurality of different portions within an image of a word in the received OCR output along a predetermined direction, one portion at a time,determine shape metric distances between the cluster image template and the plurality of different portions of the word image, respectively,use the shape metric distances to determine whether a portion of the word image matches the cluster image template,separate a matching portion of the word image that matches the cluster image template from one or more other portions of the word image, andassign one or more OCR character codes assigned to the cluster to the separated matching portion of the word image in the modified OCR output.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods and computer program products for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process. Clip images defined in a received OCR output are classified into a plurality of clusters of clip images. Clip images in each of the plurality of clusters are processed to generate a cluster image for each cluster. Shape differences between the cluster images of a first cluster and a second cluster and between the cluster images of the first cluster and a third cluster are used to determine a level of confidence in one or more first OCR character codes assigned to the first cluster.
78 Citations
5 Claims
-
1. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
one or more computers, the one or more computers implementing; a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; a cluster processing engine operable to determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster; and an output processing engine operable to; produce a modified OCR output from the received OCR output, wherein the output processing engine, in producing the modified OCR output, assigns one or more characters for each cluster processed by the cluster processing engine to each clip image belonging to the cluster, select a cluster image of a cluster having a high confidence score as a cluster image template, align the cluster image template with a plurality of different portions within an image of a word in the received OCR output along a predetermined direction, one portion at a time, determine shape metric distances between the cluster image template and the plurality of different portions of the word image, respectively, use the shape metric distances to determine whether a portion of the word image matches the cluster image template, separate a matching portion of the word image that matches the cluster image template from one or more other portions of the word image, and assign one or more OCR character codes assigned to the cluster to the separated matching portion of the word image in the modified OCR output.
-
2. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
one or more computers, the one or more computers implementing; a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; and a cluster processing engine operable to; determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster, select a cluster with a high confidence score to use a corresponding averaged image of the selected cluster as a cluster image template, align the cluster image template with a plurality of different portions of an averaged image of a second selected cluster of a low confidence score at a plurality of different positions along a predetermined direction, determine shape metric distances between the cluster image template and each of the plurality of different portions of the averaged image of the second selected cluster, use the shape metric distances for the cluster image template to determine whether a portion of the averaged image of the second selected cluster matches the cluster image template, separate a matched portion from the averaged image of the second selected cluster as a new cluster, and assign to the new cluster one or more OCR character codes that have been assigned to the selected cluster.
-
3. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
-
one or more computers, the one or more computers implementing; a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process, a cluster processing engine operable to determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster, and an OCR engine that implements the OCR process; a communication network with which the one or more computers are in communication, the communication network operable to direct the original image of the document from a client computer to the OCR engine and to direct the modified OCR output for the original image of the document from the one or more computers to the client computer; and a computer server in communication with the communication network and operable to direct a selected cluster image to one or more users for manual identification of the selected cluster image, wherein the cluster processing engine is operable to; direct a cluster image of the selected cluster to the computer server, and use manual identification of the cluster image returned from the computer server to verify or correct one or more OCR character codes assigned by the OCR engine to the selected cluster.
-
-
4. A computer-implemented method for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
-
classifying, by operations of a computer system comprising one or more computers, the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; determining, by operations of the computer system, shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances including; selecting a cluster image of a cluster having a high confidence score as a cluster image template, aligning the cluster image template with a plurality of different portions within an image of a word in the OCR output along a predetermined direction, one portion at a time, determining shape metric distances between the cluster image template and the plurality of different portions of the word image, respectively, using the shape metric distances to determine whether a portion of the word image matches the cluster image template, and separating a matching portion of the word image that matches the cluster image template from one or more other portions of the word image; and correcting one or more erroneously assigned OCR character codes for a cluster including assigning one or more OCR character codes assigned to the cluster to the separated matching portion of the word image in the modified OCR output.
-
-
5. A computer-implemented method for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
-
classifying, by operations of a computer system comprising one or more computers, the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; determining, by operations of the computer system, shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances including; selecting a cluster with a high confidence score to use a corresponding averaged image of the selected cluster as a cluster image template, aligning the cluster image template with a plurality of different portions of an averaged image of a second selected cluster of a low confidence score at a plurality of different positions along a predetermined direction, determining shape metric distances between the cluster image template and each of the plurality of different portions of the averaged image of the second selected cluster, using the shape metric distances for the cluster image template to determine whether a portion of the averaged image of the second selected cluster matches the cluster image template, and separating a matched portion from the averaged image of the second selected cluster as a new cluster; and correcting one or more erroneously assigned OCR character codes for a cluster including assigning to the new cluster one or more OCR character codes that have been assigned to the selected cluster.
-
Specification