Systems and methods for extracting table information from documents
First Claim
Patent Images
1. A method, for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
- annotating text of the unstructured document with annotations using domain knowledge of the unstructured document to produce annotated table cell data;
generating a candidate table for each of a plurality of table models using the annotated table cell data, wherein the plurality of table models for the unstructured document are selected by determining a domain for the unstructured document; and
selecting table models that are suitable for use with the determined domain;
scoring each of the candidate tables;
selecting a highest scoring candidate table; and
providing the highest scoring candidate table.
10 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for extracting table information from documents are provided herein. Exemplary methods may include annotating a document with annotations that identify table cell data included therein, generating a candidate table for each of a plurality of table models using the annotated table cell data, scoring each of the candidate tables, selecting a highest scoring candidate table, and annotating the highest scoring table to produce a final table.
-
Citations
24 Claims
-
1. A method, for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
-
annotating text of the unstructured document with annotations using domain knowledge of the unstructured document to produce annotated table cell data; generating a candidate table for each of a plurality of table models using the annotated table cell data, wherein the plurality of table models for the unstructured document are selected by determining a domain for the unstructured document; and
selecting table models that are suitable for use with the determined domain;scoring each of the candidate tables; selecting a highest scoring candidate table; and providing the highest scoring candidate table. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A table extraction system for extracting table information from a document, the table extraction system comprising:
-
a processor; and a memory for storing table extraction logic that when executed by the processor causes the table extraction system to; annotate text of a document with basic annotations to produce annotated table cell data, the basic annotations being determined from domain knowledge regarding the document; extract the annotated table cell data from the document using a table model; verify the extractions using the domain knowledge; and generate an extracted table from the annotated table cell data using a table filling strategy. - View Dependent Claims (16, 17)
-
-
18. A method for automatic generation of table models using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
-
receiving a column set comprising column types and column parameters, wherein each column in the column set is annotated as optional or non-optional; and generating a plurality of table models from permutations of combinations of columns in the column set.
-
-
19. A method for extracting table information from an unstructured document using a table extraction system that comprises a processor and table extraction logic stored in memory, wherein the processor executes the table extraction logic to perform operations comprising:
-
receiving an unstructured document; processing the unstructured document using optical character recognition (OCR) to identify table cell data; annotating the table cell data to produce annotated table cell data, the annotating using domain knowledge of the unstructured document to establish one or more rows and columns for placement of the annotated table cell data into candidate tables; generating a candidate table for each of a plurality of table models using the annotated table cell data and the established one or more rows and columns; scoring each of the candidate tables based on alignment of the established one or more rows and columns of the candidate tables; selecting a highest scoring candidate table, the highest scoring candidate table having a greatest alignment among the candidate tables; and displaying the highest scoring candidate table as a final table via a user interface (UI). - View Dependent Claims (20, 21, 22, 23, 24)
-
Specification