Systems and Methods for Extracting Table Information from Documents
First Claim
1. A method, for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
- annotating text of a document with annotations using domain knowledge of the unstructured document to produce annotated table cell data;
generating a candidate table for each of a plurality of table models using the annotated table cell data;
scoring each of the candidate tables;
selecting a highest scoring candidate table; and
providing the highest scoring candidate table.
9 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for extracting table information from documents are provided herein. Exemplary methods may include annotating a document with annotations that identify table cell data included therein, generating a candidate table for each of a plurality of table models using the annotated table cell data, scoring each of the candidate tables, selecting a highest scoring candidate table, and annotating the highest scoring table to produce a final table.
60 Citations
26 Claims
-
1. A method, for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
-
annotating text of a document with annotations using domain knowledge of the unstructured document to produce annotated table cell data; generating a candidate table for each of a plurality of table models using the annotated table cell data; scoring each of the candidate tables; selecting a highest scoring candidate table; and providing the highest scoring candidate table. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
comparing a product of the fraction of filled cells and the filling strategy specific score to a threshold; and adding the candidate table to an index if the product meets or exceeds the threshold.
-
-
8. The method according to claim 7, further comprising:
-
generating a set of scoring features for each candidate table; and generating a table score for each candidate table using a linear combination of the set of scoring features.
-
-
9. The method according to claim 1, further comprising normalizing missing or ambiguous annotations in a candidate table.
-
10. The method according to claim 1, wherein scoring each of the candidate tables comprises calculating any of:
-
a number of empty cells in a candidate table; and a number of empty rows in a candidate table.
-
-
11. The method according to claim 1, further comprising generating a table model from a user defined set of column types and column parameters.
-
12. The method according to claim 1, wherein an annotation associated with annotated table cell data is at least one of:
-
an optional annotation, such that when a column is annotated as optional two candidate tables are created, one with the optional column and one without; a group annotation, such that when a column is annotated as a group at least some annotated table cell data is filled into the group; a required annotation, such that when a column is annotated as required, any row missing an annotation for that column is removed; a non-null annotation, such that when a column is annotated as non-null, if a candidate table lacks this column, the candidate table will not be annotated; a unique annotation, such that when a column is annotated as unique, all table cells within the unique column include a unique annotated table cell data value; a non-overlapping column annotation, such that when a column is annotated as non-overlapping, only annotations which do not overlap with previously collected annotations are allowed in the column; an end line column annotation, such that when a column is annotated as an end line, a current line will end after an annotation is inserted into the column; and a new line column annotation, such that when a column is annotated as new line column, a new line will begin when annotated table cell data is inserted into this column.
-
-
13. The method according to claim 11, further comprising establishing a stop token for the table model, wherein presence of the stop token in the table model causes the termination of collection of annotated table cell data for the candidate table.
-
14. The method according to claim 11, further comprising establishing a maximum distance for a table model, wherein when two table cells are spaced apart from one another within the document at a distance less than or equal to the maximum distance, the two table cells are added to the table model.
-
15. The method according to claim 1, wherein the document includes a document that is streamed to the table extraction system, the method of claim 1 being executed during streaming of document to the table extraction system.
-
16. A table extraction system for extracting table information from a document, the table extraction system comprising:
-
a processor; and a memory for storing table extraction logic that when executed by the processor causes the table extraction system to; annotate text of a document with basic annotations to produce annotated table celldata, the basic annotations being determined from domain knowledge regarding the document; extract the annotated table cell data from the document using a table model; verify the extractions using the domain knowledge; and generate an extracted table from the annotated table cell data using a table filling strategy. - View Dependent Claims (17, 18)
-
-
19. A method for automatic generation of table models using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
-
receiving a column set comprising column types and column parameters, wherein each column in the column set is annotated as optional or non-optional; and generating a plurality of table models from permutations of combinations of columns in the column set. - View Dependent Claims (20)
-
-
21. A method for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
-
receiving an unstructured document; processing the unstructured document using optical character recognition (OCR) to identify table cell data; annotating the table cell data to produce annotated table cell data, the annotating using domain knowledge of the unstructured document to establish one or more rows and columns for placement of the annotated table cell data into candidate tables; generating a candidate table for each of a plurality of table models using the annotated table cell data and the established one or more rows and columns; scoring each of the candidate tables based on alignment of the established one or more rows and columns of the candidate tables; selecting a highest scoring candidate table, the highest scoring candidate table having a greatest alignment among the candidate tables; and displaying the highest scoring candidate table as a final table via a user interface (UI). - View Dependent Claims (22, 23, 24, 25, 26)
-
Specification