Method for extracting, interpreting and standardizing tabular data from unstructured documents
First Claim
1. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein for automating a process of structuring the tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of the tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the computer program product comprising:
- a. program instruction means for identifying a table of interest from the tabular and non-tabular data in an unstructured document by processing a set of table identification rules without programming the set of table identification rules, the set of table identification rules being based on semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document;
b. program instruction means for confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest;
c. program instruction means for tokenizing the content of the identified table into tokens by processing a set of parsing rules without programming the set of parsing rules;
d. program instruction means for interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including a standardized set of data fields, the program instruction means for interpreting the tokenized content comprises program instruction means for;
i. arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules without programming the set of section identification rules; and
ii. mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules;
e. program instruction means for standardizing the interpreted content of the identified table by processing a set of standardization rules without programming the set of standardization rules, wherein the program instruction means for standardizing of the interpreted content comprises program instruction means for;
i. aggregating the mapped tokenized content of the interpreted content of the identified table; and
ii. standardizing sign representation for numeric data of the interpreted content by processing a set of sign standardization rules without programming the set of sign standardization rules, the sign representation being standardized when different instances of the same token of the interpreted content of the identified table being represented with opposite signs;
f. program instruction means for providing hyperlinks between the unstructured document and one or more steps of the process of structuring the tabular data, the one or more steps of the process of structuring of the tabular data comprise identifying the table, confirming the identified table, tokenizing the content of the identified table, interpreting the tokenized content, and standardizing the interpreted content, the hyperlinks being provided to enable a user to navigate back to the unstructured document from the one or more steps; and
g. program instruction means for outputting the standardized table through a user interface.
5 Assignments
0 Petitions
Accused Products
Abstract
A system, method, and computer program for automatically identifying, parsing, and interpreting tabular data from unstructured documents stored in various formats such as ASCII text, Unicode text, HTML, PDF text, and PDF image format is provided. A set of table identification, parsing/tokenizing, and interpreting/mapping rules are developed with grammar descriptors. These rules are then applied to a set of documents to identify a table, parse the content of the table, and interpret the parsed content, if required, thereby standardizing the tabular data.
70 Citations
14 Claims
-
1. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein for automating a process of structuring the tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of the tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the computer program product comprising:
-
a. program instruction means for identifying a table of interest from the tabular and non-tabular data in an unstructured document by processing a set of table identification rules without programming the set of table identification rules, the set of table identification rules being based on semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document; b. program instruction means for confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest; c. program instruction means for tokenizing the content of the identified table into tokens by processing a set of parsing rules without programming the set of parsing rules; d. program instruction means for interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including a standardized set of data fields, the program instruction means for interpreting the tokenized content comprises program instruction means for; i. arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules without programming the set of section identification rules; and ii. mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules; e. program instruction means for standardizing the interpreted content of the identified table by processing a set of standardization rules without programming the set of standardization rules, wherein the program instruction means for standardizing of the interpreted content comprises program instruction means for; i. aggregating the mapped tokenized content of the interpreted content of the identified table; and ii. standardizing sign representation for numeric data of the interpreted content by processing a set of sign standardization rules without programming the set of sign standardization rules, the sign representation being standardized when different instances of the same token of the interpreted content of the identified table being represented with opposite signs; f. program instruction means for providing hyperlinks between the unstructured document and one or more steps of the process of structuring the tabular data, the one or more steps of the process of structuring of the tabular data comprise identifying the table, confirming the identified table, tokenizing the content of the identified table, interpreting the tokenized content, and standardizing the interpreted content, the hyperlinks being provided to enable a user to navigate back to the unstructured document from the one or more steps; and g. program instruction means for outputting the standardized table through a user interface. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein for automating a process of structuring tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the computer program product comprising:
-
a. program instruction means for identifying a table of interest from the tabular and non-tabular data in an unstructured document by processing a set of table identification rules without programming, the set of table identification rules being based on semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document, the program instruction means for identifying the table comprising; i. program instruction means for merging multiple valid instances of the table into a single table; b. program instruction means for confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest, the set of table confirmation rules being applied to eliminate erroneous identification of the identified table, the set of table confirmation rules detect and validate the content and end of the identified table, and the neighborhood around the beginning of the identified table; c. program instruction means for tokenizing the content of the identified table by processing a set of parsing rules without programming the set of parsing rules, the identified table being tokenized into tokens on a line by line basis, the program instruction means for tokenizing of the content of the identified table comprises program instruction means for; i. filtering the content of the identified table to remove invalid data by processing a set of invalid data rules without programming the set of invalid data rules; ii. parsing the filtered content line by line by processing a the set of parsing rules; and iii. validating the parsed content by processing a set of validation rules without programming the set of validation rules, the set of validation rules being used to verify accuracy of the parsed content, the validating of the parsed content being performed to eliminate erroneous parsing of the content of the identified table; d. program instruction means for interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including a standardized set of data fields, the program instruction means for interpreting of the tokenized content comprises program instruction means for; i. arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules without programming the set of section identification rules; and ii. mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules; e. program instruction means for standardizing the interpreted content by processing a set of standardization rules without programming the set of standardization rules, wherein the program instruction means for standardizing the interpreted content comprises program instruction means for; i. aggregating the mapped tokenized content of the interpreted content of the identified table; and ii. standardizing the sign representation for numeric data of the interpreted content by processing a set of sign standardization rules without programming the set of sign standardization rules, the sign representation being standardized when different instances of same token of the interpreted content of the identified table are represented with opposite signs; f. program instruction means for providing hyperlinks between the unstructured document and one or more steps of the process of structuring the tabular data, the one or more steps of the process of structuring of the tabular data comprise identifying the table, confirming the identified table, tokenizing the content of the identified table, interpreting the tokenized content, and standardizing the interpreted content, the hyperlinks being provided to enable a user to navigate back to the unstructured document from the one or more steps; and g. program instruction means for outputting the standardized table through a user interface. - View Dependent Claims (8, 9)
-
-
10. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein for automating a process of structuring tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the computer program product comprising:
-
a. program instruction means for identifying a table of interest from the tabular and non-tabular data in an unstructured document by processing a set of table identification rules without programming, the set of table identification rules being based on different semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document, the program instruction means for identifying the table comprises program instruction means for; i. merging multiple valid instances of the table into a single table; b. program instruction means for confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest, the set of table confirmation rules being applied to eliminate erroneous identification of the identified table, the set of table confirmation rules detect and validate the content and end of the identified table, and neighborhood around beginning of the identified table; c. program instruction means for tokenizing the content of the identified table by processing a set of parsing rules without programming the set of parsing rules, the identified table being tokenized into tokens on a line by line basis, the program instruction means for tokenizing the content of the identified table comprising; i. program instruction means for filtering the content of the identified table to remove invalid data by processing a set of invalid data rules without programming the set of invalid data rules; ii. program instruction means for parsing the filtered content line by line by processing the set of parsing rules without programming the set of parsing rules; and iii. program instruction means for validating the parsed content by processing a set of validation rules without programming the set of validation rules, the set of validation rules being used to verify the accuracy of the parsed content, the validating of the parsed content being performed to eliminate erroneous parsing of the content of the identified table; d. program instruction means for interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including a standardized set of data fields, the program instruction means for interpreting the tokenized content comprising; i. program instruction means for arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules without programming the set of section identification rules; and ii. program instruction means for mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules; e. program instruction means for standardizing the interpreted content by processing processing a set of standardization rules without programming the set of standardization rules, wherein the program instruction means for standardizing the interpreted content comprising; i. program instruction means for aggregating the mapped tokenized content of the interpreted content of the identified table; and ii. program instruction means for standardizing the sign representation for numeric data of the interpreted content by processing using a set of sign standardization rules without programming the set of sian standardization rules, the sign representation being standardized when different instances of same token of the interpreted content of the identified table being represented with opposite signs; f. program instruction means for providing a hyperlinks between the unstructured document and one or more steps of the process of structuring the tabular data, the one or more steps of the process of structuring of the tabular data include identifying the table, confirming the identified table, tokenizing the content of the identified table, interpreting the tokenized content, and standardizing the interpreted content, the hyperlinks being provided to enable a user to navigate back to the unstructured document from the one or more steps; g. program instruction means for storing the hyperlinks in a relational database management system (RDBMS), the RDBMS being used to store the structured tabular data; h. program instruction means for creating a new version of the unstructured document, the new version of the unstructured document comprising embedded hyperlinks for each element of data being extracted and structured in the one or more steps; and i. program instruction means for outputting the standardized table through a user interface.
-
-
11. A system for automating a process of structuring the tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the system comprising:
-
a. a data layer, the data layer being used to identify a table of interest by processing a set of table identification rules without programming the set of table identification rules, tokenize the content by processing a set of parsing rules without programming the set of parsing rules, map the tokenized content onto a standardized set of data fields by processing a set of mapping rules without programming the set of mapping rules, the data layer comprising a database, the database being used to store the set of table identification, parsing and mapping rules; and b. a service layer, the service layer comprising a web server and an application server, the web server being used to access the unstructured documents, the application server comprises an engine to standardize a sign representation for numeric data of content of the tabular data by processing a set of sign standardization rules, the sign representation being standardized when different instances of the same token of the content of the identified table being represented with opposite signs; c. a presentation layer, the presentation layer comprising; i. a user interface, the user interface being used to provide access to extracted, interpreted, and standardized data to a user; and ii. a relational database management system (RDBMS), the RDBMS being used to store the structured tabular data and hyperlinks between the unstructured documents and one or more steps of structuring the tabular data, the one or more steps include the identification of a table of interest, tokenizing the content and mapping the content, the hyperlinks being provided to enable a user to navigate back to the unstructured documents from the one or more steps. - View Dependent Claims (12)
-
-
13. A method for automating a process of structuring the tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the method comprising the steps of:
-
a. identifying a table of interest in an unstructured document by processing a set of table identification rules without programming the set of table identification rules, the set of table identification rules being based on semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document; b. confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest; c. tokenizing the content of the identified table into tokens by processing a set of parsing rules without programming the set of parsing rules; d. interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including a standardized set of data fields, the interpreting the tokenized content comprises the steps of; i. arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules without programming the set of section identification rules; and ii. mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules; e. standardizing the interpreted content of the table by processing a set of standardization rules without programming the set of standardization rules, wherein the standardizing of the interpreted content comprises the steps of; i. aggregating the mapped tokenized content of the interpreted content of the identified table; and ii. standardizing the sign representation for numeric data of the interpreted content by processing a set of sign standardization rules without programming the set of sign standardization rules, the sign representation being standardized when different instances of same token of the interpreted content of the identified table being represented with opposite signs; f. providing hyperlinks between the unstructured document and one or more steps of the process of structuring the tabular data, the one or more steps of the process of structuring of the tabular data comprise identifying the table, confirming the identified table, tokenizing the content of the identified table, interpreting the tokenized content, and standardizing the interpreted content, the hyperlinks being provided to enable a user to navigate back to the unstructured document from the one or more steps; and g. outputting the standardized table through a user interface.
-
-
14. A method for automating a process of structuring tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the method comprising the steps of:
-
a. identifying a table of interest in an unstructured document by processing a set of table identification rules without programming the set of table identification rules, the set of table identification rules being based on semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document, the identifying of the table comprising; i. merging multiple valid instances of the table into a single table; b. confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest, the set of table confirmation rules being applied to eliminate erroneous identification of the identified table, the set of table confirmation rules detect and validate the content and end of the identified table, and neighborhood around beginning of the identified table; c. tokenizing the content of the identified table by processing using a set of parsing rules without programming the set of parsing rules, the identified table being tokenized into tokens on a line by line basis, the tokenizing of the content of the identified table comprising; i. filtering the content of the identified table to remove invalid data by processing a set of invalid data rules without programming the set of invalid data rules; ii. parsing the filtered content line by line by processing the set of parsing rules; and iii. validating the parsed content by processing a set of validation rules without programming the set of invalidation rules, the set of validation rules being used to verify the accuracy of the parsed content, the validating of the parsed content being performed to eliminate erroneous parsing of the content of the identified table; d. interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including standardized set of data fields, the interpreting of the tokenized content comprising; i. arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules; and ii. mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules; e. standardizing the interpreted content by processing a set of standardization rules without programming the set of standardization rules, wherein the standardizing the interpreted content comprising; i. aggregating the mapped tokenized content of the interpreted content of the identified table; and ii. standardizing the sign representation for numeric data of the interpreted content by processing a set of sign standardization rules, the sign representation being standardized when different instances of same token of interpreted content of the identified table being represented with opposite signs; and f. outputting the standardized table through a user interface.
-
Specification