Method for extracting, interpreting and standardizing tabular data from unstructured documents
First Claim
Patent Images
1. A method for processing unstructured documents containing tabular data, the method comprising the steps of:
- a. identifying a table in the unstructured document using a set of identification rules;
b. tokenizing the content of the identified table using a set of parsing rules;
c. interpreting the tokenized content of the table using a set of mapping rules; and
d. standardizing the content of the table using a set of standardization rules.
5 Assignments
0 Petitions
Accused Products
Abstract
A system, method, and computer program for automatically identifying, parsing, and interpreting tabular data from unstructured documents stored in various formats such as ASCII text, Unicode text, HTML, PDF text, and PDF image format is provided. A set of table identification, parsing/tokenizing, and interpreting/mapping rules are developed with grammar descriptors. These rules are then applied to a set of documents to identify a table, parse the content of the table, and interpret the parsed content, if required, thereby standardizing the tabular data.
-
Citations
17 Claims
-
1. A method for processing unstructured documents containing tabular data, the method comprising the steps of:
-
a. identifying a table in the unstructured document using a set of identification rules;
b. tokenizing the content of the identified table using a set of parsing rules;
c. interpreting the tokenized content of the table using a set of mapping rules; and
d. standardizing the content of the table using a set of standardization rules. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of processing unstructured documents containing tabular data, the method comprising the steps of:
-
a. identifying a table in the unstructured document using a set of identification rules;
b. tokenizing the content of the identified table using a set of parsing rules;
c. interpreting the tokenized content of the table using a set of mapping rules; and
d. standardizing the content of the table using a set of standardization rules. e. identifying the links to the content of the table in the unstructured document that is identified, tokenized, interpreted and standardized;
f. storing the links to the content; and
g. presenting the links while presenting the standardized content of the table to enable a user to navigate back to the document.
-
-
12. A system for processing tabular data from unstructured documents, the system comprising:
-
a. an engine, the engine executing rules for extracting and standardizing tabular data from the unstructured documents;
b. a plurality of clients, the clients interacting with the engine;
c. a rules development user interface, the rules development user interface enabling the application designer to model the structuring rules in a visual manner, the rules development user interface being one of the plurality of clients; and
d. a database, the database storing meta data related to the rules modeled using the rules development user interface and the data extracted using the engine. - View Dependent Claims (13, 14)
-
-
15. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein processing documents containing tabular data, the computer program product comprising:
-
a. Program instruction means for identifying a table in the document using a set of identification rules;
b. Program instruction means for tokenizing the content of the identified table using a set of parsing rules;
c. Program instruction means for interpreting the tokenized content of the table using a set of mapping rules; and
d. Program instruction means for standardizing the content of the table using a set of standardization rules - View Dependent Claims (16, 17)
-
Specification