Shape clustering in post optical character recognition processing

US 8,175,394 B2
Filed: 09/08/2006
Issued: 05/08/2012
Est. Priority Date: 09/08/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method, comprising:

classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;

processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;

classifying a first cluster in the plurality of clusters as a suspect cluster;

identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;

replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;

directing a cluster image of a selected cluster to an on-line server which is operable to direct the cluster image to one or more users for manual identification of the cluster image, including using an on-line game provided by the on-line server to supply the cluster image of the selected cluster to the one or more users for a user response as part of the on-line game; and

using manual identification of the cluster image returned from the on-line server to verify one or more OCR character codes assigned to the selected cluster or assign new one or more OCR character codes to the selected cluster.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods and computer program products on storage devices for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process. The output of an OCR process is classified into a plurality of clusters of clip images and a representative image for each cluster is generated to identify clusters whose clip images were incorrectly assigned character codes by the OCR process.

27 Citations

View as Search Results

14 Claims

1. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  directing a cluster image of a selected cluster to an on-line server which is operable to direct the cluster image to one or more users for manual identification of the cluster image, including using an on-line game provided by the on-line server to supply the cluster image of the selected cluster to the one or more users for a user response as part of the on-line game; and
  
  using manual identification of the cluster image returned from the on-line server to verify one or more OCR character codes assigned to the selected cluster or assign new one or more OCR character codes to the selected cluster.

2. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  directing a cluster image of a selected cluster to an on-line server which is operable to direct the cluster image to one or more users for manual identification of the cluster image, including using an on-line service provided by the on-line server to supply the cluster image of the selected cluster as part of a challenge-response test; and
  
  using manual identification of the cluster image returned from the on-line server to verify one or more OCR character codes assigned to the selected cluster or assign new one or more OCR character codes to the selected cluster.
- View Dependent Claims (3)
- - 3. The method of claim 2, wherein:
    - the challenge-response test is for determining whether or not a user of the one-line service is a human.

4. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  selecting a cluster image of a cluster including a high confidence score as a cluster image template;
  
  aligning the cluster image template with a plurality of different portions within an image of a word in the received OCR output along a predetermined direction, one portion at a time;
  
  obtaining shape metric distances between the cluster image template and the plurality of different portions of the word image, respectively;
  
  using the obtained shape metric distances to determine whether a portion of the word image matches the cluster image template;
  
  separating a matching portion of the word image that matches the cluster image template from one or more other portions of the word image; and
  
  assigning one or more OCR character codes assigned to the cluster to the separated matching portion of the word image in the modified OCR output.

5. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  selecting a first cluster image of a first cluster as a cluster image template;
  
  aligning the cluster image template with a plurality of different portions within a second cluster image of a second selected cluster along a predetermined direction, one portion at a time;
  
  obtaining shape metric distances between the cluster image template and the plurality of different portions of the second cluster image, respectively;
  
  using the obtained shape metric distances to determine whether a portion of the second cluster image matches the cluster image template;
  
  using a matching portion of the second cluster image that matches the cluster image template as a new third cluster image for a third cluster to be formed;
  
  separating a corresponding portion in each clip image in the second cluster that corresponds to the cluster image template to form a new clip image;
  
  using new clip images respectively separated from the clip images in the second cluster to form the third cluster;
  
  assigning one or more OCR character codes assigned to the first cluster to the third cluster;
  
  using remainders of clip images of the second cluster after separation of the new clip images for the third cluster to form at least one fourth cluster; and
  
  using the third cluster and the at least one fourth cluster to replace the second cluster in producing the modified OCR output.
- View Dependent Claims (6)
- - 6. The method of claim 5, wherein:
    - the second cluster image is an image of a ligature representing at least two language tokens for one or more languages used in the document.

7. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  selecting a first cluster image of a first cluster as a cluster image template;
  
  aligning the cluster image template with a plurality of different portions within a second cluster image of a second selected cluster along a predetermined direction, one portion at a time;
  
  obtaining shape metric distances between the cluster image template and the plurality of different portions of the second cluster image, respectively;
  
  using the obtained shape metric distances to determine whether a portion of the second cluster image matches the cluster image template;
  
  using a matching portion of the second cluster image that matches the cluster image template as a new third cluster image for a third cluster to be formed;
  
  separating a corresponding portion in each clip image in the second cluster that corresponds to the cluster image template to form a new clip image;
  
  using new clip images respectively separated from the clip images in the second cluster to form the third cluster;
  
  assigning one or more OCR character codes assigned to the first cluster to the third cluster;
  
  using remainders of clip images of the second cluster after separation of the new clip images for the third cluster to form at least one fourth cluster;
  
  using the third cluster and the at least one fourth cluster to replace the second cluster in producing the modified OCR output;
  
  obtaining shape metric distances between a cluster image of the fourth cluster and cluster images of other clusters; and
  
  assigning to the fourth cluster one or more OCR character codes assigned to a fifth cluster that has a shortest shape metric distance from the fourth cluster among all other clusters.

8. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  selecting a first cluster image of a first cluster as a cluster image template;
  
  aligning the cluster image template with a plurality of different portions within a second cluster image of a second selected cluster along a predetermined direction, one portion at a time;
  
  obtaining shape metric distances between the cluster image template and the plurality of different portions of the second cluster image, respectively;
  
  using the obtained shape metric distances to determine whether a portion of the second cluster image matches the cluster image template;
  
  using a matching portion of the second cluster image that matches the cluster image template as a new third cluster image for a third cluster to be formed;
  
  separating a corresponding portion in each clip image in the second cluster that corresponds to the cluster image template to form a new clip image;
  
  using new clip images respectively separated from the clip images in the second cluster to form the third cluster;
  
  assigning one or more OCR character codes assigned to the first cluster to the third cluster;
  
  using remainders of clip images of the second cluster after separation of the new clip images for the third cluster to form at least one fourth cluster;
  
  using the third cluster and the at least one fourth cluster to replace the second cluster in producing the modified OCR output;
  
  soliciting manual identification of the cluster image of the fourth cluster from a person; and
  
  using the manual identification by the person to generate new one or more OCR character codes for the fourth cluster.

9. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed;
  
  identifying a nearest acceptable cluster to the suspect cluster, the nearest acceptable cluster being nearest based on a shape distance between the cluster image for the nearest acceptable cluster and the cluster image for the suspect cluster; and
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest acceptable cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce the modified OCR output when the cluster image for the nearest acceptable cluster has a shortest shape metric distance from the cluster image for the suspect cluster and the nearest acceptable cluster has a larger number of clip images than the suspect cluster.

10. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  selecting a cluster image of the suspect cluster to be manually identified by a person;
  
  using one or more manually generated character codes returned by the person for the selected suspect cluster to verify or replace previously assigned one or more particular character codes for the selected suspect cluster; and
  
  reclassifying the selected suspect cluster assigned the one or more manually generated character codes as an acceptable cluster.

11. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  directing a cluster image of a selected suspect cluster to an on-line server which is operable to direct the averaged image to one or more users that interact with the on-line server and to solicit one or more manually generated character codes for the averaged image from the one or more users;
  
  using one or more manually generated character codes returned from the on-line server to verify or replace previously assigned one or more particular character codes for the selected suspect cluster; and
  
  reclassifying the selected suspect cluster assigned character codes corresponding to the one or more manually generated characters as an acceptable cluster.
- View Dependent Claims (12, 13)
- - 12. The method of claim 11, wherein:
    - the on-line server is operable to provide an on-line game which supplies the averaged image of the selected suspect cluster to the one or more users for user responses as part of the on-line game.
  - 13. The method of claim 11, wherein:
    - the on-line server is operable to provide an on-line service and to supply the averaged image of the selected suspect cluster as part of a challenge-response test to determine whether or not a user of the one-line service is a human.

14. A computer-implemented method, comprising:
- classifying clip images defined in a received OCR output of a document processed by an optical character recognition (OCR) process into a plurality of clusters of clip images, each cluster including clip images that are assigned the same one or more character codes by the OCR process, wherein the OCR process performed on one or more computers generates the received OCR output;
  
  processing clip images in each of the plurality of clusters to generate exactly one cluster image for each cluster;
  
  classifying a first cluster in the plurality of clusters as a suspect cluster, including using a number of clip images in each cluster and shape metric distances of each cluster relative to other clusters to classify the plurality of clusters into acceptable clusters and suspect clusters, wherein assigned one or more particular character codes for each acceptable cluster are used in a modified OCR output without further processing and assigned one or more particular character codes for each suspect cluster are marked to be further processed;
  
  identifying a nearest cluster to the suspect cluster, the nearest cluster being nearest based on a shape distance between the cluster image for the nearest cluster and the cluster image for the suspect cluster;
  
  replacing the one or more character codes assigned to the suspect cluster with character codes assigned to the nearest cluster at each occurrence of one of the clip images of the suspect cluster in the OCR output to produce a modified OCR output;
  
  selecting at least one acceptable cluster to use a corresponding cluster image of the selected acceptable cluster as a cluster image template;
  
  aligning the cluster image template with a plurality of different portions of a cluster image of a selected suspect cluster at a plurality of different positions along a predetermined direction;
  
  obtaining shape metric distances between the cluster image template and each of the plurality of different portions of the cluster image of the selected suspect cluster;
  
  using the obtained shape metric distances for the cluster image template to determine whether a portion of the cluster image of the selected suspect cluster matches the cluster image template; and
  
  separating a matched portion from the cluster image of the selected suspect cluster as a new acceptable cluster which is assigned the one or more characters for the selected acceptable cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Vincent, Luc, Smith, Raymond W.
Primary Examiner(s)
ZARKA, DAVID PETER

Application Number

US11/517,818
Publication Number

US 20080063276A1
Time in Patent Office

2,069 Days
Field of Search

382/225
US Class Current

382/225
CPC Class Codes

G06F 18/23   Clustering techniques

G06F 18/254   of classification results, ...

G06V 10/762   using clustering, e.g. of s...

G06V 10/809   of classification results, ...

G06V 30/10   Character recognition

G06V 30/12   Detection or correction of ...

Shape clustering in post optical character recognition processing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

27 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Shape clustering in post optical character recognition processing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links