Shape clustering in post optical character recognition processing
First Claim
Patent Images
1. A method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more characters codes by the OCR process;
processing clip images in each of the plurality of clusters to generate a cluster image for each cluster;
comparing the cluster images to detect clusters to which one or more OCR character codes were erroneously assigned by the OCR process;
assigning one or more new OCR character codes to a first cluster that is detected to have an erroneously assigned one or more OCR character codes in the OCR output; and
using the one or more new OCR character codes to replace the erroneously assigned OCR character code at each occurrence of one of the clip images of the first cluster in the OCR output to produce a modified OCR output.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process.
120 Citations
113 Claims
-
1. A method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more characters codes by the OCR process; processing clip images in each of the plurality of clusters to generate a cluster image for each cluster; comparing the cluster images to detect clusters to which one or more OCR character codes were erroneously assigned by the OCR process; assigning one or more new OCR character codes to a first cluster that is detected to have an erroneously assigned one or more OCR character codes in the OCR output; and using the one or more new OCR character codes to replace the erroneously assigned OCR character code at each occurrence of one of the clip images of the first cluster in the OCR output to produce a modified OCR output. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 50)
-
-
42. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more characters codes by the OCR process; processing clip images in each of the plurality of clusters to generate a cluster image for each cluster; comparing the cluster images to detect clusters to which one or more OCR character codes were erroneously assigned by the OCR process; assigning one or more new OCR character codes to a first cluster that is detected to have an erroneously assigned one or more OCR character codes in the OCR output; and using the one or more new OCR character codes to replace the erroneously assigned OCR character code at each occurrence of one of the clip images of the first cluster in the OCR output to produce a modified OCR output.
-
-
43. A system for optical character recognition (OCR), comprising:
-
an OCR engine operable to process an original image of a document to produce an OCR output including clip images extracted from the original image and to assign one or more characters to each clip image; and a post-OCR engine operable to classify clip images the OCR output into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more characters codes by the OCR engine; wherein the post-OCR engine is operable to process clip images in each of the plurality of clusters to generate a cluster image for each cluster, compare the cluster images to detect clusters to which one or more OCR character codes were erroneously assigned by the OCR engine, assign one or more new OCR character codes to a first cluster that is detected to have an erroneously assigned one or more OCR character codes in the OCR output, and use the one or more new OCR character codes to replace the erroneously assigned OCR character code at each occurrence of one of the clip images of the first cluster in the OCR output to produce a modified OCR output. - View Dependent Claims (44, 45, 46)
-
-
47. A method for post optical character recognition (OCR) processing, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR process; processing clip images in each of the plurality of clusters to generate a cluster image for each cluster; for a first cluster assigned one or more first OCR character codes, identifying (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and using at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster. - View Dependent Claims (48, 49, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71)
-
-
72. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR process; processing clip images in each of the plurality of clusters to generate a cluster image for each cluster; for a first cluster assigned one or more first OCR character codes, identifying (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and using at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster.
-
-
73. A system for optical character recognition (OCR), comprising:
-
an OCR engine operable to process an original image of a document to produce an OCR output including clip images extracted from the original image and to assign one or more characters to each clip image; and a post-OCR engine operable to classify clip images in the OCR output into a plurality of clusters of clip images, each cluster including clip images that are identical or similar in size and are assigned the same one or more characters codes by the OCR engine; wherein the post-OCR engine is operable to process clip images in each of the plurality of clusters to generate a cluster image for each cluster, wherein the post-OCR engine is operable to identify, for a first cluster assigned one or more first OCR character codes, (1) a second cluster assigned one or more second OCR character codes different from the one or more first OCR character codes, where the cluster image of the second cluster is closer in shape to a cluster image of the first cluster than to cluster images of other clusters assigned one or more OCR characters different from the one or more first OCR character codes, and (2) a third cluster assigned the same one or more first OCR character codes as the first cluster, where the cluster image of the third cluster is closer in shape to the cluster image of the first cluster than to the cluster images of other clusters assigned the one or more first OCR character codes; and wherein the post-OCR engine is operable to use at least shape differences between the cluster images of the first cluster and the second cluster and between the cluster images of the first cluster and the third cluster to determine a level of confidence in the one or more first OCR character codes assigned to the first cluster. - View Dependent Claims (74, 75, 76)
-
-
77. A system for optical character recognition (OCR), comprising:
-
a cluster generation engine operable to receive an OCR output including separated images produced by an OCR engine in processing an original image of a document and one or more characters assigned to each separated image by the OCR engine, the cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR engine; and a cluster processing engine operable to obtain shape metric distances between a cluster image of a cluster and cluster images of other clusters and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR engine based on the obtained shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster. - View Dependent Claims (78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92)
-
-
93. A method for optical character recognition (OCR), comprising:
-
receiving an OCR output including separated images produced by an OCR engine in processing an original image of a document and one or more characters assigned to each separated image by the OCR engine, classifying the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR engine; obtaining shape metric distances between a cluster image of a cluster and cluster images of other clusters and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR engine based on the obtained shape metric distances; and correcting one or more erroneously assigned OCR character codes for a cluster. - View Dependent Claims (94, 95, 96, 97, 98, 99, 100, 101, 102, 103)
-
-
104. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
receiving an OCR output including separated images produced by an optical character recognition (OCR) process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, classifying the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process; obtaining shape metric distances between a cluster image of a cluster and cluster images of other clusters and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the obtained shape metric distances; and correcting one or more erroneously assigned OCR character codes for a cluster.
-
-
105. A method, comprising:
-
classifying clip images defined in a received OCR output, from an optical character recognition (OCR) process that processes an original document image, into a plurality of clusters of clip images, each cluster comprising clip images of identical or similar image sizes and shapes that are assigned the same one or more particular characters by the OCR process; and applying gray scale or color information from the original document image in averaging clip images in each cluster to generate an averaged image for each cluster. - View Dependent Claims (106, 107, 108, 109)
-
-
110. A system for optical character recognition (OCR), comprising:
-
an OCR engine operable to process an original image of a document to produce an OCR output including clip images extracted from the original image and to assign one or more characters to each clip image; and a post-OCR engine operable to classify clip images in the OCR output into a plurality of clusters of clip images, each cluster comprising clip images of identical or similar image sizes and shapes that are assigned the same one or more particular characters by the OCR engine, and wherein the post-OCR engine is operable to apply gray scale or color information from the original document image in averaging clip images in each cluster to generate an averaged image for each cluster. - View Dependent Claims (111, 112)
-
-
113. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
classifying clip images defined in a received OCR output, from an optical character recognition (OCR) process that processes an original document image, into a plurality of clusters of clip images, each cluster comprising clip images of identical or similar image sizes and shapes that are assigned the same one or more particular characters by the OCR process; and applying gray scale or color information from the original document image in averaging clip images in each cluster to generate an averaged image for each cluster.
-
Specification