×

Shape clustering in post optical character recognition processing

  • US 8,111,927 B2
  • Filed: 05/20/2010
  • Issued: 02/07/2012
  • Est. Priority Date: 09/08/2006
  • Status: Expired due to Fees
First Claim
Patent Images

1. A system for processing an optical character recognition (OCR) output including separated images produced by an OCR process in processing an original image of a document and one or more characters assigned to each separated image by the OCR process, comprising:

  • one or more computers, the one or more computers implementing;

    a cluster generation engine operable to classify the separated images in the OCR output into a plurality of clusters of separated images that are of a particular image size and are assigned the same one or more OCR character codes by the OCR process;

    a cluster processing engine operable to determine shape metric distances between a cluster image of a cluster and cluster images of other clusters, wherein each cluster image is representative of separated images in each cluster, and to detect whether an error exists in assignment of one or more OCR character codes assigned to each cluster by the OCR process based on the shape metric distances, the cluster processing engine further operable to correct one or more erroneously assigned OCR character codes for a cluster; and

    an output processing engine operable to;

    produce a modified OCR output from the received OCR output, wherein the output processing engine, in producing the modified OCR output, assigns one or more characters for each cluster processed by the cluster processing engine to each clip image belonging to the cluster,select a cluster image of a cluster having a high confidence score as a cluster image template,align the cluster image template with a plurality of different portions within an image of a word in the received OCR output along a predetermined direction, one portion at a time,determine shape metric distances between the cluster image template and the plurality of different portions of the word image, respectively,use the shape metric distances to determine whether a portion of the word image matches the cluster image template,separate a matching portion of the word image that matches the cluster image template from one or more other portions of the word image, andassign one or more OCR character codes assigned to the cluster to the separated matching portion of the word image in the modified OCR output.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×