System and method for document section segmentation
First Claim
Patent Images
1. A system and method for document heading categorization, comprising the steps of:
- constructing a first data set consisting of exemplars having at least one pair of expressions and corresponding codes;
constructing a second data set having a structural hierarchy, where the second data set contains at least one corresponding code mapped to at least one expression;
transforming at least one of the expressions into a first representation, where the first representation includes sequential word features;
constructing a target data set consisting of at least one first representation and at least one corresponding code;
comparing a candidate string to the target data set;
identifying a least dissimilar target representation in the target data set having a dissimilarity score exceeding a first pre-determined value;
providing the corresponding code of the least dissimilar target in the target data set;
selectively saving a candidate string having a dissimilarity score not exceeding a second pre-determined value; and
selectively reviewing the saved candidate string and assigning its representation and corresponding code to the target data set.
6 Assignments
0 Petitions
Accused Products
Abstract
A system and method for facilitating the processing and the use of documents by providing a system for categorizing document section headings under a set of canonical section headings. In the method for categorizing section headings, there may be a process of training a database and matching methods to categorize different but equivalent document section headings under canonical headings and categories. Once trained the system may match and categorize the document sections with little to no supervision of the categorization for large sets of documents.
-
Citations
9 Claims
-
1. A system and method for document heading categorization, comprising the steps of:
-
constructing a first data set consisting of exemplars having at least one pair of expressions and corresponding codes;
constructing a second data set having a structural hierarchy, where the second data set contains at least one corresponding code mapped to at least one expression;
transforming at least one of the expressions into a first representation, where the first representation includes sequential word features;
constructing a target data set consisting of at least one first representation and at least one corresponding code;
comparing a candidate string to the target data set;
identifying a least dissimilar target representation in the target data set having a dissimilarity score exceeding a first pre-determined value;
providing the corresponding code of the least dissimilar target in the target data set;
selectively saving a candidate string having a dissimilarity score not exceeding a second pre-determined value; and
selectively reviewing the saved candidate string and assigning its representation and corresponding code to the target data set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
Specification