×

Methods for automatic structured extraction of data in OCR documents having tabular data

  • US 9,251,413 B2
  • Filed: 06/13/2014
  • Issued: 02/02/2016
  • Est. Priority Date: 06/14/2013
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatically extracting tabular data from an OCR scan document, comprising:

  • receiving the OCR scan document for processing;

    identifying, for each OCR scan line, any string and its position within the OCR scan line and forming a content cell line by associating a content cell with each identified string;

    localizing table column content cells by locating repetitive string patterns within each OCR scan line;

    forming a cluster group of OCR scan lines when the repetitive string patterns exceed a predetermined confidence level;

    determining for any ungrouped line that such ungrouped line is a header line when a comparison of one or more strings of such ungrouped line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data exceeds a predetermined confidence threshold;

    associating a header cell for each of the one or more strings in each determined header line;

    for each header line and its associated header cell, associating the group of cluster OCR scan lines and the corresponding content cell line and the corresponding strings of each content cell line with a corresponding header string in the header line when a format of the string of the corresponding content cell line is consistent with a format of the corresponding header string;

    for each header line, determining if a repetitive pattern of strings is present and if a repetitive pattern is determined, localizing the location of a breakpoint between each repetitive pattern and processing the header line and the corresponding content cell line as a multi-stack table;

    determining, for each header line, whether one or more header strings are missing patterns and repairing the header line missing patterns by providing a dummy data string for each missing pattern in the header strings;

    determining, for each content cell line, whether one or more content cell line strings are missing patterns and repairing the content cell line missing patterns by providing a dummy data string for each missing pattern in the content cell line strings; and

    for each header string having a corresponding column identifier in the output table, mapping the corresponding content cells into a corresponding number of cells in the output table aligned with the column identifier.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×