SYSTEM AND METHOD FOR DOCUMENT SECTION SEGMENTATION
First Claim
1. An automated computer implemented method for categorizing document section headings in a plurality of documents comprising the steps of:
- determining a set of canonical section headings from subset of said plurality of documents;
establishing a data base containing said canonical section headings;
extracting each section heading from the remainder of the documents, transforming said section headings into a plurality of n-grams, and;
associating particular section heading with a particular canonical section heading in the data base if said n-grams associated with a section heading reach a predetermined level of similarity to said canonical section headings.
4 Assignments
0 Petitions
Accused Products
Abstract
A system and method for facilitating the processing and the use of documents by providing a system for categorizing document section headings under a set of canonical section headings. In the method for categorizing section headings, there may be a process of training a database and matching methods to categorize different but equivalent document section headings under canonical headings and categories. Once trained, the system may match and categorize the document sections with little to no supervision of the categorization for large sets of documents.
88 Citations
5 Claims
-
1. An automated computer implemented method for categorizing document section headings in a plurality of documents comprising the steps of:
-
determining a set of canonical section headings from subset of said plurality of documents;
establishing a data base containing said canonical section headings;
extracting each section heading from the remainder of the documents, transforming said section headings into a plurality of n-grams, and;
associating particular section heading with a particular canonical section heading in the data base if said n-grams associated with a section heading reach a predetermined level of similarity to said canonical section headings. - View Dependent Claims (2, 3, 4, 5)
-
Specification