Automated understanding and decomposition of table-structured electronic documents
First Claim
1. A method for understanding and decomposing a document, the method comprising:
- utilizing at least one of the following algorithms to understand and decompose the document;
one or more pre-processing algorithms;
one or more token identification algorithms;
one or more token type identification algorithms;
one or more column count identification algorithms;
one or more column boundary identification algorithms;
one or more column type identification algorithms;
one or more token-to-column assignment algorithms; and
one or more line merging algorithms, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for automatically understanding and decomposing unstructured tabular information are described. No constraints are placed on the origin or format of these documents when originally submitted; the documents may be in an unstructured and/or nonstandard format, and they may be electronic or flat files. The systems and methods of this invention generally comprise obtaining an electronic ASCII-formatted document, analyzing and understanding the contents of the document, and decomposing the information contained in the document, utilizing a variety of algorithms and heuristics to do this. Embodiments of this invention automatically process a multitude of financial documents, thereby eliminating the need for human interaction with such documents in many cases and lowering the costs associated with processing such documents.
91 Citations
33 Claims
-
1. A method for understanding and decomposing a document, the method comprising:
-
utilizing at least one of the following algorithms to understand and decompose the document;
one or more pre-processing algorithms;
one or more token identification algorithms;
one or more token type identification algorithms;
one or more column count identification algorithms;
one or more column boundary identification algorithms;
one or more column type identification algorithms;
one or more token-to-column assignment algorithms; and
one or more line merging algorithms,wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for understanding and decomposing a document, the system comprising:
-
a means for utilizing at least one of the following algorithms to understand and decompose the document;
one or more pre-processing algorithms;
one or more token identification algorithms;
one or more token type identification algorithms;
one or more column count identification algorithms;
one or more column boundary identification algorithms;
one or more column type identification algorithms;
one or more token-to-column assignment algorithms; and
one or more line merging algorithms,wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A method for understanding and decomposing a document, the method comprising:
-
preprocessing text in the document;
identifying a physical layout of the document by establishing tokens;
characterizing the tokens in the document as at least one of;
numeric, text and date;
establishing a column count of the number of columns in the document;
establishing column boundaries for each column;
establishing a column type for each column;
assigning tokens to a column;
identifying spanning tokens;
identifying wrapping lines;
identifying a table construct and a relationship between the tokens and table cells;
identifying special rows and special cells in the document;
identifying logical layout of the document;
interpreting text in the document; and
applying validation rules to verify totals and subtotals are correct.
-
Specification