Automated understanding and decomposition of table-structured electronic documents

US 20040193520A1
Filed: 03/27/2003
Published: 09/30/2004
Est. Priority Date: 03/27/2003
Status: Abandoned Application

First Claim

Patent Images

1. A method for understanding and decomposing a document, the method comprising:

utilizing at least one of the following algorithms to understand and decompose the document;

one or more pre-processing algorithms;

one or more token identification algorithms;

one or more token type identification algorithms;

one or more column count identification algorithms;

one or more column boundary identification algorithms;

one or more column type identification algorithms;

one or more token-to-column assignment algorithms; and

one or more line merging algorithms, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for automatically understanding and decomposing unstructured tabular information are described. No constraints are placed on the origin or format of these documents when originally submitted; the documents may be in an unstructured and/or nonstandard format, and they may be electronic or flat files. The systems and methods of this invention generally comprise obtaining an electronic ASCII-formatted document, analyzing and understanding the contents of the document, and decomposing the information contained in the document, utilizing a variety of algorithms and heuristics to do this. Embodiments of this invention automatically process a multitude of financial documents, thereby eliminating the need for human interaction with such documents in many cases and lowering the costs associated with processing such documents.

91 Citations

View as Search Results

33 Claims

1. A method for understanding and decomposing a document, the method comprising:
- utilizing at least one of the following algorithms to understand and decompose the document;
  
  one or more pre-processing algorithms;
  
  one or more token identification algorithms;
  
  one or more token type identification algorithms;
  
  one or more column count identification algorithms;
  
  one or more column boundary identification algorithms;
  
  one or more column type identification algorithms;
  
  one or more token-to-column assignment algorithms; and
  
  one or more line merging algorithms, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein the method is performed automatically by a computer system.
  - 3. The method of claim 1, wherein the document comprises tabular information.
  - 4. The method of claim 1, wherein the document comprises at least one of:
    - an ASCII text document, an EBCDIC text document, a spreadsheet, a PDF file, a Postscript file, and an HTML document.
  - 5. The method of claim 1, wherein the document comprises a financial statement.
  - 6. The method of claim 5, wherein the financial statement comprises at least one of:
    - a balance sheet, an income statement, and a cash flow statement.
  - 7. The method of claim 1, wherein the document comprises an electronic document.
  - 8. The method of claim 7, wherein the electronic document is obtained electronically via at least one of:
    - the Internet, an electronic mail message, an intranet, an extranet, and a scanner.
  - 9. The method of claim 1, wherein the one or more pre-processing algorithms comprise at least one of:
    - removing anomalous characters from the file and replacing at least some of the anomalous characters with other characters that will not change the meaning of the document;
      
      removing dollar signs;
      
      replacing tab characters with a predetermined number of spaces;
      
      removing sequences of multiple underscores;
      
      removing sequences of multiple periods;
      
      removing characters having non-ASCII values; and
      
      replacing runs of one or two dashes with a zero.
  - 10. The method of claim 1, wherein the one or more token identification algorithms comprise at least one of:
    - identifying, as tokens, strings of non-space characters having no more than two consecutive internal space characters;
      
      identifying textual elements for each row of text that are a predetermined number of spaces from a left or right non-space neighbor;
      
      skipping single tokens that comprise only a “
      
      $”
      
      character; and
      
      establishing a predetermined white space threshold via statistical evaluation distribution of white space markers throughout the document.
  - 11. The method of claim 1, wherein the one or more token type identification algorithms comprise:
    - identifying the token type as at least one of;
      
      numeric, text, and date.
  - 12. The method of claim 1, wherein the one or more column count identification algorithms comprise:
    - determining a statistical average of the population of tokens in each row.
  - 13. The method of claim 1, wherein the one or more column boundary identification algorithms comprise at least one of:
    - sequentially positioning the tokens within the columns identified by the one or more column count identification algorithms;
      
      establishing a start point of each column;
      
      establishing an end point of each column; and
      
      extending the start point and the end point of each column proportionately to the size of the columns to accommodate gaps between columns.
  - 14. The method of claim 1, wherein the one or more column type identification algorithms comprise:
    - assigning default column types to columns in the document.
  - 15. The method of claim 1, wherein the one or more token-to-column assignment algorithms comprise:
    - assigning each token to one or more columns based on the boundaries of the columns within which the token falls and adjusting the token assignments as necessary to accommodate tokens that span multiple cells.
  - 16. The method of claim 1, wherein the one or more line merging algorithms comprise:
    - utilizing natural language processing to combine multiple tokens in consecutive rows that should actually be a single token.

17. A system for understanding and decomposing a document, the system comprising:
- a means for utilizing at least one of the following algorithms to understand and decompose the document;
  
  one or more pre-processing algorithms;
  
  one or more token identification algorithms;
  
  one or more token type identification algorithms;
  
  one or more column count identification algorithms;
  
  one or more column boundary identification algorithms;
  
  one or more column type identification algorithms;
  
  one or more token-to-column assignment algorithms; and
  
  one or more line merging algorithms, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 18. The system of claim 17, wherein a computer system is used to automatically understand and decompose the document.
  - 19. The system of claim 17, wherein the document comprises tabular information.
  - 20. The system of claim 17, wherein the document comprises at least one of:
    - an ASCII text document, an EBCDIC text document, a spreadsheet, a PDF file, a Postscript file, and an HTML document.
  - 21. The system of claim 17, wherein the document comprises a financial statement.
  - 22. The system of claim 21, wherein the financial statement comprises at least one of:
    - a balance sheet, an income statement, and a cash flow statement.
  - 23. The system of claim 17, wherein the document comprises an electronic document.
  - 24. The system of claim 23, wherein the electronic document is obtained electronically via at least one of:
    - the Internet, an electronic mail message, an intranet, an extranet, and a scanner.
  - 25. The system of claim 17, wherein the one or more pre-processing algorithms comprise at least one of:
    - removing anomalous characters from the file and replacing at least some of the anomalous characters with other characters that will not change the meaning of the document;
      
      removing dollar signs;
      
      replacing tab characters with a predetermined number of spaces;
      
      removing sequences of multiple underscores;
      
      removing sequences of multiple periods;
      
      removing characters having non-ASCII values; and
      
      replacing runs of one or two dashes with a zero.
  - 26. The system of claim 17, wherein the one or more token identification algorithms comprise at least one of:
    - identifying, as tokens, strings of non-space characters having no more than two consecutive internal space characters;
      
      identifying textual elements for each row of text that are a predetermined number of spaces from a left or right non-space neighbor;
      
      skipping single tokens that comprise only a “
      
      $”
      
      character; and
      
      establishing a predetermined white space threshold via statistical evaluation distribution of white space markers throughout the document.
  - 27. The system of claim 17, wherein the one or more token type identification algorithms comprise:
    - identifying the token type as at least one of;
      
      numeric, text, and date.
  - 28. The system of claim 17, wherein the one or more column count identification algorithms comprise:
    - determining a statistical average of the population of tokens in each row.
  - 29. The system of claim 17, wherein the one or more column boundary identification algorithms comprise at least one of:
    - sequentially positioning the tokens within the columns identified by the one or more column count identification algorithms;
      
      establishing a start point of each column;
      
      establishing an end point of each column; and
      
      extending the start point and the end point of each column proportionately to the size of the columns to accommodate gaps between columns.
  - 30. The system of claim 17, wherein the one or more column type identification algorithms comprise:
    - assigning default column types to columns in the document.
  - 31. The system of claim 17, wherein the one or more token-to-column assignment algorithms comprise:
    - assigning each token to one or more columns based on the boundaries of the columns within which the token falls and adjusting the token assignments as necessary to accommodate tokens that span multiple cells.
  - 32. The system of claim 17, wherein the one or more line merging algorithms comprise:
    - utilizing natural language processing to combine multiple tokens in consecutive rows that should actually be a single token.

33. A method for understanding and decomposing a document, the method comprising:
- preprocessing text in the document;
  
  identifying a physical layout of the document by establishing tokens;
  
  characterizing the tokens in the document as at least one of;
  
  numeric, text and date;
  
  establishing a column count of the number of columns in the document;
  
  establishing column boundaries for each column;
  
  establishing a column type for each column;
  
  assigning tokens to a column;
  
  identifying spanning tokens;
  
  identifying wrapping lines;
  
  identifying a table construct and a relationship between the tokens and table cells;
  
  identifying special rows and special cells in the document;
  
  identifying logical layout of the document;
  
  interpreting text in the document; and
  
  applying validation rules to verify totals and subtotals are correct.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
General Electric Company
Original Assignee
General Electric Company
Inventors
Klein, Eric, Laymon, Marc, LaComb, Christina

Application Number

US10/400,982
Publication Number

US 20040193520A1
Time in Patent Office

Days
Field of Search
US Class Current

705/35
CPC Class Codes

G06Q 10/10 Office automation; Time man...

G06Q 40/00 Finance; Insurance; Tax str...

Automated understanding and decomposition of table-structured electronic documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

91 Citations

33 Claims

Specification

Use Cases

Quick Links

Others

Automated understanding and decomposition of table-structured electronic documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

91 Citations

33 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others