Method for extracting, interpreting and standardizing tabular data from unstructured documents

US 20060288268A1
Filed: 05/27/2005
Published: 12/21/2006
Est. Priority Date: 05/27/2005
Status: Active Grant

First Claim

Patent Images

1. A method for processing unstructured documents containing tabular data, the method comprising the steps of:

a. identifying a table in the unstructured document using a set of identification rules;

b. tokenizing the content of the identified table using a set of parsing rules;

c. interpreting the tokenized content of the table using a set of mapping rules; and

d. standardizing the content of the table using a set of standardization rules.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method, and computer program for automatically identifying, parsing, and interpreting tabular data from unstructured documents stored in various formats such as ASCII text, Unicode text, HTML, PDF text, and PDF image format is provided. A set of table identification, parsing/tokenizing, and interpreting/mapping rules are developed with grammar descriptors. These rules are then applied to a set of documents to identify a table, parse the content of the table, and interpret the parsed content, if required, thereby standardizing the tabular data.

Citations

17 Claims

1. A method for processing unstructured documents containing tabular data, the method comprising the steps of:
- a. identifying a table in the unstructured document using a set of identification rules;
  
  b. tokenizing the content of the identified table using a set of parsing rules;
  
  c. interpreting the tokenized content of the table using a set of mapping rules; and
  
  d. standardizing the content of the table using a set of standardization rules.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 further comprising the steps of:
    - a. Identifying the links to the content of the table in the unstructured document that is identified and standardized;
      
      b. Storing the links; and
      
      c. Presenting the links while presenting the standardized content of the table to enable a user to navigate back to the unstructured document.
  - 3. The method according to claim 1 wherein the step of identifying the table comprises the step of confirming the identified table using a set of table confirmation rules.
  - 4. The method according to claim 1, wherein the step of identifying the table includes the step of merging of multiple occurrences of the table in the document.
  - 5. The method according to claim 1, wherein the step of identifying the table includes the step of the merging of related tables.
  - 6. The method according to claim 1, wherein the step of tokenizing the content of the identified table comprises the steps of:
    - a. filtering the content of the identified table to remove invalid data using invalid data rules;
      
      b. parsing the filtered content line by line using the set of parsing rules; and
      
      c. validating the parsed content using a set of validation rules.
  - 7. The method according to claim 6 wherein the step of validating the parsed content comprises the step of discovering the hierarchical mathematical structure underlying the table.
  - 8. The method according to claim 1 wherein the step of interpreting the tokenized content of the table comprises the steps of:
    - a. identifying sections in the parsed content of the table using a set of section identification rules; and
      
      b. interpreting the parsed content by using a set of mapping rules to identify the corresponding item in a standardized template.
  - 9. The method according to claim 1, wherein the step of standardizing the interpreted content comprises the steps of:
    - a. aggregating the mappings including intermediate calculations; and
      
      b. normalizing the signs of numeric values by comparing the implicit signs for the standardized item in the normalization process with the sign associated with the numeric value and the implicit sign used in the document.
  - 10. The method of claim 1 wherein the rules required for identifying, extracting, interpreting and standardizing tabular data are stored as meta-data.

11. A method of processing unstructured documents containing tabular data, the method comprising the steps of:
- a. identifying a table in the unstructured document using a set of identification rules;
  
  b. tokenizing the content of the identified table using a set of parsing rules;
  
  c. interpreting the tokenized content of the table using a set of mapping rules; and
  
  d. standardizing the content of the table using a set of standardization rules. e. identifying the links to the content of the table in the unstructured document that is identified, tokenized, interpreted and standardized;
  
  f. storing the links to the content; and
  
  g. presenting the links while presenting the standardized content of the table to enable a user to navigate back to the document.

12. A system for processing tabular data from unstructured documents, the system comprising:
- a. an engine, the engine executing rules for extracting and standardizing tabular data from the unstructured documents;
  
  b. a plurality of clients, the clients interacting with the engine;
  
  c. a rules development user interface, the rules development user interface enabling the application designer to model the structuring rules in a visual manner, the rules development user interface being one of the plurality of clients; and
  
  d. a database, the database storing meta data related to the rules modeled using the rules development user interface and the data extracted using the engine.
- View Dependent Claims (13, 14)
- - 13. The system according to claim 12 further comprising a plurality of pre-built rules for extracting and standardizing tabular data from the unstructured documents wherein the rules are stored as meta data in the database, the rules comprising:
    - a. a plurality of identification rules for identifying a table in the unstructured document;
      
      b. a plurality of tokenizing rules for tokenizing the content of the identified table;
      
      c. a plurality of interpreting rules for interpreting the tokenized content; and
      
      d. a plurality of standardizing rules for standardizing the interpreting content;
  - 14. The system according to claim 12 further comprising means for identifying the links to the content of the table in the unstructured document.

15. A computer program product for use with a computer, the computer program product comprising a computer usable medium having a computer readable program code embodied therein processing documents containing tabular data, the computer program product comprising:
- a. Program instruction means for identifying a table in the document using a set of identification rules;
  
  b. Program instruction means for tokenizing the content of the identified table using a set of parsing rules;
  
  c. Program instruction means for interpreting the tokenized content of the table using a set of mapping rules; and
  
  d. Program instruction means for standardizing the content of the table using a set of standardization rules
- View Dependent Claims (16, 17)
- - 16. The computer program product according to claim 15 further comprising program instruction means for extracting and standardizing tabular data from the unstructured documents based on predefined rules, wherein the rules are stored as meta data in the database, the rules comprising:
    - e. a plurality of identification rules for identifying a table in the unstructured document;
      
      f. a plurality of tokenizing rules for tokenizing the content of the identified table;
      
      g. a plurality of interpreting rules for interpreting the tokenized content; and
      
      h. a plurality of standardizing rules for standardizing the interpreting content;
  - 17. The system according to claim 15 further comprising program instructions means for identifying the links to the content of the table in the unstructured document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Genpact USA, Inc. (Genpact Limited)
Original Assignee
RAGE Frameworks, Inc. (Genpact Limited)
Inventors
Bharadwaj, Srinivasan, Srinivasan, Venkatesan, Alam, Rummana, Kothiwale, Mahantesh

Granted Patent

US 7,590,647 B2
Time in Patent Office

Days
Field of Search
US Class Current

715/210
CPC Class Codes

G06F 16/86   Mapping to a database

G06F 40/177   of tables; using ruled lines

G06F 40/205   Parsing

G06F 40/226   Validation

Y10S 707/99943   Generating database or data...

Method for extracting, interpreting and standardizing tabular data from unstructured documents

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method for extracting, interpreting and standardizing tabular data from unstructured documents

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links