Shape clustering in post optical character recognition processing
First Claim
Patent Images
1. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
- one or more computers, the one or more computers implementing;
a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; and
a cluster processing engine operable to determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances,wherein the cluster processing engine is further operable to correct one or more erroneously assigned OCR character codes for a first cluster including replacing one or more first OCR character codes assigned to the first cluster with one or more second OCR character codes assigned to a second cluster as new one or more OCR character codes for the first cluster when the second cluster has a shortest shape metric distance from the first cluster among all other clusters and the second cluster has a higher level of confidence than the first cluster.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and computer program products on storage devices for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process. The output of an OCR process is classified into a plurality of clusters of clip images and a representative image for each cluster is generated to identify clusters whose clip images were incorrectly assigned character codes by the OCR process.
30 Citations
30 Claims
-
1. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
one or more computers, the one or more computers implementing; a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; and a cluster processing engine operable to determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances, wherein the cluster processing engine is further operable to correct one or more erroneously assigned OCR character codes for a first cluster including replacing one or more first OCR character codes assigned to the first cluster with one or more second OCR character codes assigned to a second cluster as new one or more OCR character codes for the first cluster when the second cluster has a shortest shape metric distance from the first cluster among all other clusters and the second cluster has a higher level of confidence than the first cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
13. A computer-implemented method for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
-
classifying, by operations of a computer system comprising one or more computers, the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; determining, by operations of the computer system, shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances; and correcting one or more erroneously assigned OCR character codes for a first cluster including replacing one or more first OCR character codes assigned to the first cluster with one or more second OCR character codes assigned to a second cluster as new one or more OCR character codes for the first cluster when the second cluster has a shortest shape metric distance from the first cluster among all other clusters and the second cluster has a higher level of confidence than the first cluster. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer program product, encoded on a computer-readable storage device, operable to cause data processing apparatus to perform operations for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:
-
classifying the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; determining shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances; and correcting one or more erroneously assigned OCR character codes for a first cluster including replacing one or more first OCR character codes assigned to the first cluster with one or more second OCR character codes assigned to a second cluster as new one or more OCR character codes for the first cluster when the second cluster has a shortest shape metric distance from the first cluster among all other clusters and the second cluster has a higher level of confidence than the first cluster. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
-
Specification