Systems and methods for extracting table information from documents

US 9,495,347 B2
Filed: 07/16/2013
Issued: 11/15/2016
Est. Priority Date: 07/16/2013
Status: Active Grant

First Claim

Patent Images

1. A method, for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:

annotating text of the unstructured document with annotations using domain knowledge of the unstructured document to produce annotated table cell data;

generating a candidate table for each of a plurality of table models using the annotated table cell data, wherein the plurality of table models for the unstructured document are selected by determining a domain for the unstructured document; and

selecting table models that are suitable for use with the determined domain;

scoring each of the candidate tables;

selecting a highest scoring candidate table; and

providing the highest scoring candidate table.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for extracting table information from documents are provided herein. Exemplary methods may include annotating a document with annotations that identify table cell data included therein, generating a candidate table for each of a plurality of table models using the annotated table cell data, scoring each of the candidate tables, selecting a highest scoring candidate table, and annotating the highest scoring table to produce a final table.

Citations

24 Claims

1. A method, for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
- annotating text of the unstructured document with annotations using domain knowledge of the unstructured document to produce annotated table cell data;
  
  generating a candidate table for each of a plurality of table models using the annotated table cell data, wherein the plurality of table models for the unstructured document are selected by determining a domain for the unstructured document; and
  
  selecting table models that are suitable for use with the determined domain;
  
  scoring each of the candidate tables;
  
  selecting a highest scoring candidate table; and
  
  providing the highest scoring candidate table.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, wherein the document includes an unstructured document and at least some of the text of the document is not in a table format.
  - 3. The method according to claim 1, wherein generating a candidate table for each of a plurality of table models further comprises:
    - parsing the annotated table cell data in the unstructured document;
      
      collecting the annotated table cell data for a table model; and
      
      populating columns of a candidate table using the annotated table cell data, using a filling strategy.
  - 4. The method according to claim 3, wherein the filling strategy comprises any of shortest column filling, row-wise filling, row-wise filling with line breaks, and column-wise filling.
  - 5. The method according to claim 3, further comprising:
    - calculating a fraction of filled cells and a filling strategy specific score;
      
      comparing a product of the fraction of filled cells and the filling strategy specific score to a threshold; and
      
      adding the candidate table to an index if the product meets or exceeds the threshold.
  - 6. The method according to claim 5, further comprising:
    - generating a set of scoring features for each candidate table; and
      
      generating a table score for each candidate table using a linear combination of the set of scoring features.
  - 7. The method according to claim 1, further comprising terminating collection of annotated table cell data upon encountering a stop token annotation.
  - 8. The method according to claim 1, further comprising normalizing missing or ambiguous annotations in a candidate table.
  - 9. The method according to claim 1, wherein scoring each of the candidate tables comprises calculating any of:
    - a number of empty cells in a candidate table; and
      
      a number of empty rows in a candidate table.
  - 10. The method according to claim 1, further comprising generating a table model from a user defined set of column types and column parameters.
  - 11. The method according to claim 10, further comprising establishing a stop token for the table model, wherein presence of the stop token in the table model causes the termination of collection of annotated table cell data for the candidate table.
  - 12. The method according to claim 10, further comprising establishing a maximum distance for a table model, wherein when two table cells are spaced apart from one another within the unstructured document at a distance less than or equal to the maximum distance, the two table cells are added to the table model.
  - 13. The method according to claim 1, wherein an annotation associated with the annotated table cell data is at least one of:
    - an optional annotation, such that when a column is annotated as optional two candidate tables are created, one with the optional column and one without;
      
      a group annotation, such that when a column is annotated as a group at least some annotated table cell data is filled into the group;
      
      a required annotation, such that when a column is annotated as required, any row missing an annotation for that column is removed;
      
      a non-null annotation, such that when a column is annotated as non-null, if a candidate table lacks this column, the candidate table will not be annotated;
      
      a unique annotation, such that when a column is annotated as unique, all table cells within the unique column include a unique annotated table cell data value;
      
      a non-overlapping column annotation, such that when a column is annotated as non-overlapping, only annotations which do not overlap with previously collected annotations are allowed in the column;
      
      an end line column annotation, such that when a column is annotated as an end line, a current line will end after an annotation is inserted into the column; and
      
      a new line column annotation, such that when a column is annotated as new line column, a new line will begin when annotated table cell data is inserted into this column.
  - 14. The method according to claim 1, wherein the unstructured document includes a document that is streamed to the table extraction system, the method of claim 1 being executed during streaming of the document to the table extraction system.

15. A table extraction system for extracting table information from a document, the table extraction system comprising:
- a processor; and
  
  a memory for storing table extraction logic that when executed by the processor causes the table extraction system to;
  
  annotate text of a document with basic annotations to produce annotated table cell data, the basic annotations being determined from domain knowledge regarding the document;
  
  extract the annotated table cell data from the document using a table model;
  
  verify the extractions using the domain knowledge; and
  
  generate an extracted table from the annotated table cell data using a table filling strategy.
- View Dependent Claims (16, 17)
- - 16. The system according to claim 15, wherein the document includes a document that has been processed using optical character recognition (OCR), wherein the document comprises OCR distorted table elements, wherein the table extraction system corrects the OCR distorted table elements by verifying extractions using domain knowledge.
  - 17. The system according to claim 15, wherein the processor further executes the table extraction logic to normalize the extracted table to format the extracted table in accordance with a desired output format.

18. A method for automatic generation of table models using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
- receiving a column set comprising column types and column parameters, wherein each column in the column set is annotated as optional or non-optional; and
  
  generating a plurality of table models from permutations of combinations of columns in the column set.

19. A method for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
- receiving an unstructured document;
  
  processing the unstructured document using optical character recognition (OCR) to identify table cell data;
  
  annotating the table cell data to produce annotated table cell data, the annotating using domain knowledge of the unstructured document to establish one or more rows and columns for placement of the annotated table cell data into candidate tables;
  
  generating a candidate table for each of a plurality of table models using the annotated table cell data and the established one or more rows and columns;
  
  scoring each of the candidate tables based on alignment of the established one or more rows and columns of the candidate tables;
  
  selecting a highest scoring candidate table, the highest scoring candidate table having a greatest alignment among the candidate tables; and
  
  displaying the highest scoring candidate table as a final table via a user interface (UI).
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The method according to claim 19, wherein the table cell data is related to bond ratings.
  - 21. The method according to claim 19, wherein the candidate tables are generated using at least one of a:
    - shortest column filling strategy, row-wise filling strategy, and column-wise filling strategy.
  - 22. The method according to claim 19, wherein the candidate tables are normalized to resolve OCR distortions.
  - 23. The method according to claim 19, wherein the final table is displayed in a classic row/column layout using at least one of:
    - whitespace, supporting lines, and coloring.
  - 24. The method according to claim 19, wherein annotations for the annotated table cell data are at least one of:
    - optional annotations, such that when a column is annotated as optional two candidate tables are created, one with the optional column and one without;
      
      group annotations, such that when a column is annotated as a group at least some table cell data is filled into the group;
      
      required annotations, such that when a column is annotated as required, any row missing an annotation for that column is removed;
      
      non-null annotations, such that when a column is annotated as non-null, if a candidate table lacks this column, the candidate table will not be annotated;
      
      unique annotations, such that when a column is annotated as unique, all table cells within the unique column include a unique table cell data value;
      
      non-overlapping column annotations, such that when a column is annotated as non-overlapping, only annotations which do not overlap with previously collected annotations are allowed in the column;
      
      end line column annotations, such that when a column is annotated as an end line, a current line will end after an annotation is inserted into the column; and
      
      new line column annotations, such that when a column is annotated as a new line column, a new line will begin when table cell data is inserted into the column.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Open Text Holdings, Inc. (Open Text Corporation)
Original Assignee
Recommind Incorporated (Open Text Corporation)
Inventors
Stadermann, Jan, Symons, Stephan, Thon, Ingo
Primary Examiner(s)
HUYNH, CONG LAC T

Application Number

US13/943,668
Publication Number

US 20150026556A1
Time in Patent Office

1,218 Days
Field of Search

715/227, 715/228, 715/212, 715/230, 715/232, 715/200
US Class Current

1/1
CPC Class Codes

G06F 40/177 of tables; using ruled lines

G06F 40/183 Tabulation, i.e. one-dimens...

Systems and methods for extracting table information from documents

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for extracting table information from documents

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links