Shape Clustering in Post Optical Character Recognition Processing
First Claim
Patent Images
1. A method for post optical character recognition (OCR) processing, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR process;
processing clip images in each of the plurality of clusters to generate a cluster image for each cluster;
for a first cluster assigned one or more first OCR character codes, identifying (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and
using at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process.
-
Citations
58 Claims
-
1. A method for post optical character recognition (OCR) processing, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR process; processing clip images in each of the plurality of clusters to generate a cluster image for each cluster; for a first cluster assigned one or more first OCR character codes, identifying (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and using at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR process; processing clip images in each of the plurality of clusters to generate a cluster image for each cluster; for a first cluster assigned one or more first OCR character codes, identifying (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and using at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster.
-
-
27. A system for optical character recognition (OCR), comprising:
-
an OCR engine operable to process an original image of a document to produce an OCR output including clip images extracted from the original image and to assign one or more characters to each clip image; and a post-OCR engine operable to classify clip images in the OCR output into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR engine; wherein the post-OCR engine is operable to process clip images in each of the plurality of clusters to generate a cluster image for each cluster, wherein the post-OCR engine is operable to identify, for a first cluster assigned one or more first OCR character codes, (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and wherein the post-OCR engine is operable to use at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster. - View Dependent Claims (28, 29, 30)
-
-
31. A system for optical character recognition (OCR), comprising:
-
a cluster generation engine operable to receive an OCR output including separated images produced by an OCR engine in processing an original image of a document and one or more characters assigned to each separated image by the OCR engine, the cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR engine; and a cluster processing engine operable to obtain shape metric distances between a cluster image of a cluster and cluster images of other clusters and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR engine based on the obtained shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
47. A method for optical character recognition (OCR), comprising:
-
receiving an OCR output including separated images produced by an OCR engine in processing an original image of a document and one or more characters assigned to each separated image by the OCR engine, classifying the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR engine; obtaining shape metric distances between a cluster image of a cluster and cluster images of other clusters and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR engine based on the obtained shape metric distances; and correcting one or more erroneously assigned OCR character codes for a cluster. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
-
-
58. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
receiving an OCR output including separated images produced by an optical character recognition (OCR) process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, classifying the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; obtaining shape metric distances between a cluster image of a cluster and cluster images of other clusters and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the obtained shape metric distances; and correcting one or more erroneously assigned OCR character codes for a cluster.
-
Specification