Methods for automatic structured extraction of data in OCR documents having tabular data
First Claim
Patent Images
1. A method for automatically extracting tabular data from an OCR scan document, comprising:
- receiving the OCR scan document for processing;
identifying, for each OCR scan line, any string and its position within the OCR scan line and forming a content cell line by associating a content cell with each identified string;
localizing table column content cells by locating repetitive string patterns within each OCR scan line;
forming a cluster group of OCR scan lines when the repetitive string patterns exceed a predetermined confidence level;
determining for any ungrouped line that such ungrouped line is a header line when a comparison of one or more strings of such ungrouped line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data exceeds a predetermined confidence threshold;
associating a header cell for each of the one or more strings in each determined header line;
for each header line and its associated header cell, associating the group of cluster OCR scan lines and the corresponding content cell line and the corresponding strings of each content cell line with a corresponding header string in the header line when a format of the string of the corresponding content cell line is consistent with a format of the corresponding header string;
for each header line, determining if a repetitive pattern of strings is present and if a repetitive pattern is determined, localizing the location of a breakpoint between each repetitive pattern and processing the header line and the corresponding content cell line as a multi-stack table;
determining, for each header line, whether one or more header strings are missing patterns and repairing the header line missing patterns by providing a dummy data string for each missing pattern in the header strings;
determining, for each content cell line, whether one or more content cell line strings are missing patterns and repairing the content cell line missing patterns by providing a dummy data string for each missing pattern in the content cell line strings; and
for each header string having a corresponding column identifier in the output table, mapping the corresponding content cells into a corresponding number of cells in the output table aligned with the column identifier.
9 Assignments
0 Petitions
Accused Products
Abstract
Methods to select and extract tabular data among the optical character recognition returned strings to automatically process documents, including documents containing academic transcripts.
34 Citations
20 Claims
-
1. A method for automatically extracting tabular data from an OCR scan document, comprising:
-
receiving the OCR scan document for processing; identifying, for each OCR scan line, any string and its position within the OCR scan line and forming a content cell line by associating a content cell with each identified string; localizing table column content cells by locating repetitive string patterns within each OCR scan line; forming a cluster group of OCR scan lines when the repetitive string patterns exceed a predetermined confidence level; determining for any ungrouped line that such ungrouped line is a header line when a comparison of one or more strings of such ungrouped line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data exceeds a predetermined confidence threshold; associating a header cell for each of the one or more strings in each determined header line; for each header line and its associated header cell, associating the group of cluster OCR scan lines and the corresponding content cell line and the corresponding strings of each content cell line with a corresponding header string in the header line when a format of the string of the corresponding content cell line is consistent with a format of the corresponding header string; for each header line, determining if a repetitive pattern of strings is present and if a repetitive pattern is determined, localizing the location of a breakpoint between each repetitive pattern and processing the header line and the corresponding content cell line as a multi-stack table; determining, for each header line, whether one or more header strings are missing patterns and repairing the header line missing patterns by providing a dummy data string for each missing pattern in the header strings; determining, for each content cell line, whether one or more content cell line strings are missing patterns and repairing the content cell line missing patterns by providing a dummy data string for each missing pattern in the content cell line strings; and for each header string having a corresponding column identifier in the output table, mapping the corresponding content cells into a corresponding number of cells in the output table aligned with the column identifier. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for automatically extracting tabular data from an OCR scan file, comprising:
-
receiving the OCR scan file having a plurality of content cells containing the tabular data to be extracted; determining, based on a general learn set having a plurality of documents types and a document specific learn set for each document type having information describing a tabular data expected to be contained therein, a document type contained in the received OCR scan file and selecting, based on the determined document type, the corresponding document specific learn set; extracting the tabular data from the plurality of content cells by; identifying, for each OCR scan line in the OCR scan file, one or more strings and the position thereof in the OCR scan line and associating the one or more identified strings with a corresponding content cell, said one or more content cells forming a content cell line; determining a string pattern for each of the identified one or more strings; localizing one or more table column content cells within the content cell line by locating a repetitive string pattern within the determined one or more string patterns in one or more content cells of the content cell line; forming one or more cluster groups of content cell lines found to have repetitive string patterns which exceed a predetermined confidence level when said repetitive string patterns are compared to one another; comparing, for each ungrouped content cell line, the one or more strings of such ungrouped content cell line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data; identifying such ungrouped content cell line as a header line and associating a header cell for each string of the one or more strings in the header line to form one or more header strings when the comparison exceeds a predetermined confidence threshold; associating each found header line and its corresponding one or more header strings with a cluster group in the one or more cluster groups of content cell lines and the corresponding strings of each corresponding content cell therein when a format of the corresponding string of the corresponding content cell is substantially the same as a format of corresponding header string; determining, for each found header line, whether or not a repetitive pattern in the header strings is present, and, when a repetitive pattern is determined, localizing a location of a breakpoint between each repeating pattern and identifying such determined header line and any corresponding cluster group of content cell lines as a multi-stack table; repairing each header line and each content cell line in the corresponding cluster group of content cell lines that is missing a pattern by providing a dummy data string corresponding to each missing pattern; and mapping, for each header string in each found header line having a corresponding column identifier in the output table, the content cells of the corresponding group of content cell lines into a corresponding number of cells in the output table aligned with the column identifier. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 19, 20)
-
-
15. A method for automatically extracting tabular data from an OCR document, the method comprising:
-
creating an output table for receiving the tabular data to be extracted from the OCR document, and a mapping between the tabular data to be extracted and the output table; receiving the OCR document having a plurality of content cells containing the tabular data to be extracted; determining, based on a general learn set, a document type contained in the received OCR document and selecting, based on the determined document type, a corresponding document specific learn set; extracting the tabular data from the plurality of content cells by; for each OCR scan line in the OCR document, identifying one or more strings and the position thereof and associating a content cell with each of the one or more identified strings, said one or more content cells forming a content cell line; determining a string pattern for the identified one or more strings; localizing one or more table column content cells by locating one or more repetitive string patterns within the one or more content cells of the content cell line; performing line clustering to form one or more cluster groups of content cell lines found to have repetitive string patterns which exceed a predetermined confidence level when said repetitive string patterns are compared to one another; comparing, for each ungrouped content cell line, the one or more strings of such ungrouped content cell line with a predefined set of corresponding header strings associated with respective column identifiers used in the output table and when the comparison exceeds a predetermined confidence threshold, setting such upgrouped content cell line as a header line and associating a header cell for each string of the one or more strings in the header line to form one or more header strings; associating each header line and its corresponding header strings with a cluster group in the one or more cluster groups of content cell lines and the corresponding content cells therein and the corresponding strings of each content cell therein when a format of the corresponding string of the corresponding content cell is substantially the same as a format of corresponding header string; determining, for each header line, whether or not a repetitive pattern in the header strings is present and, when a repetitive pattern is identified, localizing a location of a breakpoint to be between each repetitive pattern and identifying such header line and any corresponding cluster group of content cell lines as a multi-stack table; determining for each header line and each content cell line in the corresponding cluster group of content cell lines whether one or more strings in each of such lines are missing patterns and repairing each said content cell line by providing a dummy data string corresponding to each missing pattern; mapping, for each header string in each header line having a corresponding column identifier in the output table, the content cells of the corresponding group of content cell lines into a corresponding number of cells in the output table that are aligned with the column identifier. - View Dependent Claims (16, 17, 18)
-
Specification