×

Method for extracting, interpreting and standardizing tabular data from unstructured documents

  • US 7,590,647 B2
  • Filed: 05/27/2005
  • Issued: 09/15/2009
  • Est. Priority Date: 05/27/2005
  • Status: Active Grant
First Claim
Patent Images

1. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein for automating a process of structuring the tabular data from unstructured documents, the unstructured documents comprising tabular and non-tabular data, the structuring of the tabular data from the unstructured documents being performed by accessing and processing a plurality of rules without programming of the plurality of rules, the rules being modeled as meta-data and stored as the meta-data in a database to provide flexibility of adding and modifying the rules, the computer program product comprising:

  • a. program instruction means for identifying a table of interest from the tabular and non-tabular data in an unstructured document by processing a set of table identification rules without programming the set of table identification rules, the set of table identification rules being based on semantic descriptions, format, structure, grammar and content of the tabular and non-tabular data in the unstructured document;

    b. program instruction means for confirming the identified table by processing a set of table confirmation rules without programming the set of table confirmation rules, the set of table confirmation rules being used to verify that the identified table is a table of interest;

    c. program instruction means for tokenizing the content of the identified table into tokens by processing a set of parsing rules without programming the set of parsing rules;

    d. program instruction means for interpreting the tokenized content of the identified table, the tokenized content being interpreted with reference to a standardized template including a standardized set of data fields, the program instruction means for interpreting the tokenized content comprises program instruction means for;

    i. arranging the tokenized content of the identified table into one or more sections by processing a set of section identification rules without programming the set of section identification rules; and

    ii. mapping the tokenized content of each of the one or more sections onto the standardized set of data fields, by processing a set of interpretation rules without programming the set of interpretation rules;

    e. program instruction means for standardizing the interpreted content of the identified table by processing a set of standardization rules without programming the set of standardization rules, wherein the program instruction means for standardizing of the interpreted content comprises program instruction means for;

    i. aggregating the mapped tokenized content of the interpreted content of the identified table; and

    ii. standardizing sign representation for numeric data of the interpreted content by processing a set of sign standardization rules without programming the set of sign standardization rules, the sign representation being standardized when different instances of the same token of the interpreted content of the identified table being represented with opposite signs;

    f. program instruction means for providing hyperlinks between the unstructured document and one or more steps of the process of structuring the tabular data, the one or more steps of the process of structuring of the tabular data comprise identifying the table, confirming the identified table, tokenizing the content of the identified table, interpreting the tokenized content, and standardizing the interpreted content, the hyperlinks being provided to enable a user to navigate back to the unstructured document from the one or more steps; and

    g. program instruction means for outputting the standardized table through a user interface.

View all claims
  • 5 Assignments
Timeline View
Assignment View
    ×
    ×