Shape clustering in post optical character recognition processing
First Claim
Patent Images
1. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
classifying a first cluster in the plurality of clusters as a suspect cluster;
identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
directing a cluster image of a selected cluster to an on-line server which is operable to direct the cluster image to one or more users for manual identification of the cluster image, including using an on-line game provided by the on-line server to supply the cluster image of the selected cluster to the one or more users for a user response as part of the on-line game; and
using manual identification of the cluster image returned from the on-line server to verify one or more OCR character codes assigned to the selected cluster or assign new one or more OCR character codes to the selected cluster.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods and computer program products on storage devices for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process. The output of an OCR process is classified into a plurality of clusters of clip images and a representative image for each cluster is generated to identify clusters whose clip images were incorrectly assigned character codes by the OCR process.
27 Citations
14 Claims
-
1. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; directing a cluster image of a selected cluster to an on-line server which is operable to direct the cluster image to one or more users for manual identification of the cluster image, including using an on-line game provided by the on-line server to supply the cluster image of the selected cluster to the one or more users for a user response as part of the on-line game; and using manual identification of the cluster image returned from the on-line server to verify one or more OCR character codes assigned to the selected cluster or assign new one or more OCR character codes to the selected cluster.
-
-
2. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; directing a cluster image of a selected cluster to an on-line server which is operable to direct the cluster image to one or more users for manual identification of the cluster image, including using an on-line service provided by the on-line server to supply the cluster image of the selected cluster as part of a challenge-response test; and using manual identification of the cluster image returned from the on-line server to verify one or more OCR character codes assigned to the selected cluster or assign new one or more OCR character codes to the selected cluster. - View Dependent Claims (3)
-
-
4. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; selecting a cluster image of a cluster including a high confidence score as a cluster image template; aligning the cluster image template with a plurality of different portions within an image of a word in the received OCR output along a predetermined direction, one portion at a time; obtaining shape metric distances between the cluster image template and the plurality of different portions of the word image, respectively; using the obtained shape metric distances to determine whether a portion of the word image matches the cluster image template; separating a matching portion of the word image that matches the cluster image template from one or more other portions of the word image; and assigning one or more OCR character codes assigned to the cluster to the separated matching portion of the word image in the modified OCR output.
-
-
5. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; selecting a first cluster image of a first cluster as a cluster image template; aligning the cluster image template with a plurality of different portions within a second cluster image of a second selected cluster along a predetermined direction, one portion at a time; obtaining shape metric distances between the cluster image template and the plurality of different portions of the second cluster image, respectively; using the obtained shape metric distances to determine whether a portion of the second cluster image matches the cluster image template; using a matching portion of the second cluster image that matches the cluster image template as a new third cluster image for a third cluster to be formed; separating a corresponding portion in each clip image in the second cluster that corresponds to the cluster image template to form a new clip image; using new clip images respectively separated from the clip images in the second cluster to form the third cluster; assigning one or more OCR character codes assigned to the first cluster to the third cluster; using remainders of clip images of the second cluster after separation of the new clip images for the third cluster to form at least one fourth cluster; and using the third cluster and the at least one fourth cluster to replace the second cluster in producing the modified OCR output. - View Dependent Claims (6)
-
-
7. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; selecting a first cluster image of a first cluster as a cluster image template; aligning the cluster image template with a plurality of different portions within a second cluster image of a second selected cluster along a predetermined direction, one portion at a time; obtaining shape metric distances between the cluster image template and the plurality of different portions of the second cluster image, respectively; using the obtained shape metric distances to determine whether a portion of the second cluster image matches the cluster image template; using a matching portion of the second cluster image that matches the cluster image template as a new third cluster image for a third cluster to be formed; separating a corresponding portion in each clip image in the second cluster that corresponds to the cluster image template to form a new clip image; using new clip images respectively separated from the clip images in the second cluster to form the third cluster; assigning one or more OCR character codes assigned to the first cluster to the third cluster; using remainders of clip images of the second cluster after separation of the new clip images for the third cluster to form at least one fourth cluster; using the third cluster and the at least one fourth cluster to replace the second cluster in producing the modified OCR output; obtaining shape metric distances between a cluster image of the fourth cluster and cluster images of other clusters; and assigning to the fourth cluster one or more OCR character codes assigned to a fifth cluster that has a shortest shape metric distance from the fourth cluster among all other clusters.
-
-
8. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; selecting a first cluster image of a first cluster as a cluster image template; aligning the cluster image template with a plurality of different portions within a second cluster image of a second selected cluster along a predetermined direction, one portion at a time; obtaining shape metric distances between the cluster image template and the plurality of different portions of the second cluster image, respectively; using the obtained shape metric distances to determine whether a portion of the second cluster image matches the cluster image template; using a matching portion of the second cluster image that matches the cluster image template as a new third cluster image for a third cluster to be formed; separating a corresponding portion in each clip image in the second cluster that corresponds to the cluster image template to form a new clip image; using new clip images respectively separated from the clip images in the second cluster to form the third cluster; assigning one or more OCR character codes assigned to the first cluster to the third cluster; using remainders of clip images of the second cluster after separation of the new clip images for the third cluster to form at least one fourth cluster; using the third cluster and the at least one fourth cluster to replace the second cluster in producing the modified OCR output; soliciting manual identification of the cluster image of the fourth cluster from a person; and using the manual identification by the person to generate new one or more OCR character codes for the fourth cluster.
-
-
9. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed; identifying a nearest acceptable cluster to the suspect cluster, the nearest acceptable cluster being nearest based on a shape distance between the cluster image for the nearest acceptable cluster and the cluster image for the suspect cluster; and replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest acceptable cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce the modified OCR output when the cluster image for the nearest acceptable cluster has a shortest shape metric distance from the cluster image for the suspect cluster and the nearest acceptable cluster has a larger number of clip images than the suspect cluster.
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
-
10. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; selecting a cluster image of the suspect cluster to be manually identified by a person; using one or more manually generated character codes returned by the person for the selected suspect cluster to verify or replace previously assigned one or more particular character codes for the selected suspect cluster; and reclassifying the selected suspect cluster assigned the one or more manually generated character codes as an acceptable cluster.
-
-
11. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; directing a cluster image of a selected suspect cluster to an on-line server which is operable to direct the averaged image to one or more users that interact with the on-line server and to solicit one or more manually generated character codes for the averaged image from the one or more users; using one or more manually generated character codes returned from the on-line server to verify or replace previously assigned one or more particular character codes for the selected suspect cluster; and reclassifying the selected suspect cluster assigned character codes corresponding to the one or more manually generated characters as an acceptable cluster. - View Dependent Claims (12, 13)
-
-
14. A computer-implemented method, comprising:
-
classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output; processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster; classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed; identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster; replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output; selecting at least one acceptable cluster to use a corresponding cluster image of the selected acceptable cluster as a cluster image template; aligning the cluster image template with a plurality of different portions of a cluster image of a selected suspect cluster at a plurality of different positions along a predetermined direction; obtaining shape metric distances between the cluster image template and each of the plurality of different portions of the cluster image of the selected suspect cluster; using the obtained shape metric distances for the cluster image template to determine whether a portion of the cluster image of the selected suspect cluster matches the cluster image template; and separating a matched portion from the cluster image of the selected suspect cluster as a new acceptable cluster which is assigned the one or more characters for the selected acceptable cluster.
-
Specification