Methods for automatic structured extraction of data in OCR documents having tabular data

US 9,251,413 B2
Filed: 06/13/2014
Issued: 02/02/2016
Est. Priority Date: 06/14/2013
Status: Active Grant

First Claim

Patent Images

1. A method for automatically extracting tabular data from an OCR scan document, comprising:

receiving the OCR scan document for processing;

identifying, for each OCR scan line, any string and its position within the OCR scan line and forming a content cell line by associating a content cell with each identified string;

localizing table column content cells by locating repetitive string patterns within each OCR scan line;

forming a cluster group of OCR scan lines when the repetitive string patterns exceed a predetermined confidence level;

determining for any ungrouped line that such ungrouped line is a header line when a comparison of one or more strings of such ungrouped line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data exceeds a predetermined confidence threshold;

associating a header cell for each of the one or more strings in each determined header line;

for each header line and its associated header cell, associating the group of cluster OCR scan lines and the corresponding content cell line and the corresponding strings of each content cell line with a corresponding header string in the header line when a format of the string of the corresponding content cell line is consistent with a format of the corresponding header string;

for each header line, determining if a repetitive pattern of strings is present and if a repetitive pattern is determined, localizing the location of a breakpoint between each repetitive pattern and processing the header line and the corresponding content cell line as a multi-stack table;

determining, for each header line, whether one or more header strings are missing patterns and repairing the header line missing patterns by providing a dummy data string for each missing pattern in the header strings;

determining, for each content cell line, whether one or more content cell line strings are missing patterns and repairing the content cell line missing patterns by providing a dummy data string for each missing pattern in the content cell line strings; and

for each header string having a corresponding column identifier in the output table, mapping the corresponding content cells into a corresponding number of cells in the output table aligned with the column identifier.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods to select and extract tabular data among the optical character recognition returned strings to automatically process documents, including documents containing academic transcripts.

34 Citations

View as Search Results

20 Claims

1. A method for automatically extracting tabular data from an OCR scan document, comprising:
- receiving the OCR scan document for processing;
  
  identifying, for each OCR scan line, any string and its position within the OCR scan line and forming a content cell line by associating a content cell with each identified string;
  
  localizing table column content cells by locating repetitive string patterns within each OCR scan line;
  
  forming a cluster group of OCR scan lines when the repetitive string patterns exceed a predetermined confidence level;
  
  determining for any ungrouped line that such ungrouped line is a header line when a comparison of one or more strings of such ungrouped line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data exceeds a predetermined confidence threshold;
  
  associating a header cell for each of the one or more strings in each determined header line;
  
  for each header line and its associated header cell, associating the group of cluster OCR scan lines and the corresponding content cell line and the corresponding strings of each content cell line with a corresponding header string in the header line when a format of the string of the corresponding content cell line is consistent with a format of the corresponding header string;
  
  for each header line, determining if a repetitive pattern of strings is present and if a repetitive pattern is determined, localizing the location of a breakpoint between each repetitive pattern and processing the header line and the corresponding content cell line as a multi-stack table;
  
  determining, for each header line, whether one or more header strings are missing patterns and repairing the header line missing patterns by providing a dummy data string for each missing pattern in the header strings;
  
  determining, for each content cell line, whether one or more content cell line strings are missing patterns and repairing the content cell line missing patterns by providing a dummy data string for each missing pattern in the content cell line strings; and
  
  for each header string having a corresponding column identifier in the output table, mapping the corresponding content cells into a corresponding number of cells in the output table aligned with the column identifier.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the identifying, the localizing, the forming, the determining the header line, the associating the header cell, the associating the group of the cluster OCR scan lines, the determining the repetitive pattern, the determining the missing patterns in the header strings, the determining the missing patterns in the content cell line strings, and the mapping are performed concurrently.
  - 3. The method of claim 1, further comprising receiving user input for at least one of a general learn set, a specific learn set, an arrangement of the output table, and a mapping of table columns of the OCR scan document to the output table.
  - 4. The method of claim 1, further comprising determining a document type contained in the OCR scan document and selecting an appropriate document specific learn set.
  - 5. The method of claim 4, wherein the determining the document type is determined based on a general learn set having a plurality of document types and on a document specific learn set for the document type.

6. A method for automatically extracting tabular data from an OCR scan file, comprising:
- receiving the OCR scan file having a plurality of content cells containing the tabular data to be extracted;
  
  determining, based on a general learn set having a plurality of documents types and a document specific learn set for each document type having information describing a tabular data expected to be contained therein, a document type contained in the received OCR scan file and selecting, based on the determined document type, the corresponding document specific learn set;
  
  extracting the tabular data from the plurality of content cells by;
  
  identifying, for each OCR scan line in the OCR scan file, one or more strings and the position thereof in the OCR scan line and associating the one or more identified strings with a corresponding content cell, said one or more content cells forming a content cell line;
  
  determining a string pattern for each of the identified one or more strings;
  
  localizing one or more table column content cells within the content cell line by locating a repetitive string pattern within the determined one or more string patterns in one or more content cells of the content cell line;
  
  forming one or more cluster groups of content cell lines found to have repetitive string patterns which exceed a predetermined confidence level when said repetitive string patterns are compared to one another;
  
  comparing, for each ungrouped content cell line, the one or more strings of such ungrouped content cell line with a predefined set of corresponding header strings associated with respective column identifiers used in an output table for extracted data;
  
  identifying such ungrouped content cell line as a header line and associating a header cell for each string of the one or more strings in the header line to form one or more header strings when the comparison exceeds a predetermined confidence threshold;
  
  associating each found header line and its corresponding one or more header strings with a cluster group in the one or more cluster groups of content cell lines and the corresponding strings of each corresponding content cell therein when a format of the corresponding string of the corresponding content cell is substantially the same as a format of corresponding header string;
  
  determining, for each found header line, whether or not a repetitive pattern in the header strings is present, and, when a repetitive pattern is determined, localizing a location of a breakpoint between each repeating pattern and identifying such determined header line and any corresponding cluster group of content cell lines as a multi-stack table;
  
  repairing each header line and each content cell line in the corresponding cluster group of content cell lines that is missing a pattern by providing a dummy data string corresponding to each missing pattern; and
  
  mapping, for each header string in each found header line having a corresponding column identifier in the output table, the content cells of the corresponding group of content cell lines into a corresponding number of cells in the output table aligned with the column identifier.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 19, 20)
- - 7. The method of claim 6, wherein, for each document type specific learn set, the information describing the tabular data includes a user provided mapping of input column identifiers to the output table and a lexicon of header strings and respective column identifiers.
  - 8. The method of claim 6, wherein, for each document type specific learn set, the information describing the tabular data includes a user provided mapping of input column identifiers to the output table and a user provided format description of the tabular data to be extracted.
  - 9. The method of claim 6, wherein, for each document type specific learn set, the information describing the tabular data includes a user provided mapping of input column identifiers to the output table and a pre-defined set of words and strings.
  - 10. The method of claim 6, wherein, for each document type specific learn set, the information describing the tabular data includes a user provided mapping of input column identifiers to the output table and at least one page split line.
  - 11. The method of claim 6, wherein, for each document type specific learn set, the information describing the tabular data includes a lexicon of specific strings, the lexicon of specific strings including a specific header string associated with the corresponding document type, a plurality of specific format descriptions, a character set, a plurality of specific table column identifications for use in the output table, and a user provided mapping of headers found in the OCR scan file to be used in the output table.
  - 12. The method of claim 6, wherein upon determining that at least one of the header cells in a header line is empty, inserting a predetermined string having dummy data in a format corresponding to a string expected to be found in such empty header cell.
  - 13. The method of claim 6, further comprising determining whether multiple tables of tabular data are present in the OCR scan file and, if it is determined that multiple tables are present, determining a breakpoint therebetween.
  - 14. The method of claim 6, further comprising determining whether multiple tables of tabular data are present in the OCR scan file and, if it is determined that multiple tables are present, determining whether the multiple tables are vertically separated and rotating the OCR scan file so that the multiple tables are horizontal if it is determined that the multiple tables are vertically separated.
  - 19. The method of claim 12 further comprising determining whether at least two tables of tabular data are present in the OCR scan file and, when it is determined that the at least two tables are present, determining breakpoints therebetween.
  - 20. The method of claim 12 further comprising determining whether at least two tables of tabular data are present in the OCR scan file and, when it is determined that the at least two tables are present, determining whether the at least two tables are vertically separated and, on determining that the at least two tables are vertically separated, rotating the OCR document so that the two tables are horizontal.

15. A method for automatically extracting tabular data from an OCR document, the method comprising:
- creating an output table for receiving the tabular data to be extracted from the OCR document, and a mapping between the tabular data to be extracted and the output table;
  
  receiving the OCR document having a plurality of content cells containing the tabular data to be extracted;
  
  determining, based on a general learn set, a document type contained in the received OCR document and selecting, based on the determined document type, a corresponding document specific learn set;
  
  extracting the tabular data from the plurality of content cells by;
  
  for each OCR scan line in the OCR document, identifying one or more strings and the position thereof and associating a content cell with each of the one or more identified strings, said one or more content cells forming a content cell line;
  
  determining a string pattern for the identified one or more strings;
  
  localizing one or more table column content cells by locating one or more repetitive string patterns within the one or more content cells of the content cell line;
  
  performing line clustering to form one or more cluster groups of content cell lines found to have repetitive string patterns which exceed a predetermined confidence level when said repetitive string patterns are compared to one another;
  
  comparing, for each ungrouped content cell line, the one or more strings of such ungrouped content cell line with a predefined set of corresponding header strings associated with respective column identifiers used in the output table and when the comparison exceeds a predetermined confidence threshold, setting such upgrouped content cell line as a header line and associating a header cell for each string of the one or more strings in the header line to form one or more header strings;
  
  associating each header line and its corresponding header strings with a cluster group in the one or more cluster groups of content cell lines and the corresponding content cells therein and the corresponding strings of each content cell therein when a format of the corresponding string of the corresponding content cell is substantially the same as a format of corresponding header string;
  
  determining, for each header line, whether or not a repetitive pattern in the header strings is present and, when a repetitive pattern is identified, localizing a location of a breakpoint to be between each repetitive pattern and identifying such header line and any corresponding cluster group of content cell lines as a multi-stack table;
  
  determining for each header line and each content cell line in the corresponding cluster group of content cell lines whether one or more strings in each of such lines are missing patterns and repairing each said content cell line by providing a dummy data string corresponding to each missing pattern;
  
  mapping, for each header string in each header line having a corresponding column identifier in the output table, the content cells of the corresponding group of content cell lines into a corresponding number of cells in the output table that are aligned with the column identifier.
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15 wherein, for each document type learn set, the information describing the tabular data includes a user provided mapping input column identifiers to the output table and at least one page split line.
  - 17. The method of claim 15 wherein, for each document type learn set, the information describing the tabular data includes a lexicon of specific strings having a specific header string associated with the corresponding document type, a plurality of specific format descriptions, a character set, a plurality of specific table column identifications for use in the output table, and a user-provided mapping of headers found in the OCR document to be used in the output table containing the extracted data.
  - 18. The method of claim 15 wherein upon determining that at least one of the header cells in a head line is empty, inserting a predetermined string having dummy data in format corresponding to a string expected to be found in such empty header cell.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hyland Switzerland, Sarl
Original Assignee
Lexmark International Technology, S.A. (Ninestar Corporation)
Inventors
Meier, Ralph, Urbschat, Harry, Wanschura, Thorsten, Hausmann, Johannes
Primary Examiner(s)
Werner, Brian P

Application Number

US14/304,181
Publication Number

US 20140369602A1
Time in Patent Office

599 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 18/23   Clustering techniques

G06V 30/412   Layout analysis of document...

G06V 30/414   Extracting the geometrical ...

G06V 30/416   Extracting the logical stru...

Methods for automatic structured extraction of data in OCR documents having tabular data

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods for automatic structured extraction of data in OCR documents having tabular data

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links