System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing
First Claim
1. A method, comprising:
- receiving at least one document;
identifying sections and associated section types within said at least one document;
identifying sub-sections within said at least one document;
defining new section types and new sub-section heading constructs when sections having known section types are identified; and
learning new section heading keywords when sections having known section types are identified.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for processing documents by utilizing the textual content and layout of the documents, including visual indicators, to more efficiently and reliably process the documents across various document types. The system and method identifies visually distinguishable elements within the document, such as section and sub-section boundary indicators, to mark, divide and label the boundaries and content type such that the sections are more clearly identifiable and easily processed. The system and method uses known elements, including section heading types, keywords, section type classifiers, sub-section heading constructs, stop words, and the like to adaptively identify and process a broad range of document types. The system and method continually refines and updates these known elements and allows users to discover and define new elements for further refinement and updating.
14 Citations
25 Claims
-
1. A method, comprising:
-
receiving at least one document; identifying sections and associated section types within said at least one document; identifying sub-sections within said at least one document; defining new section types and new sub-section heading constructs when sections having known section types are identified; and learning new section heading keywords when sections having known section types are identified. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system, comprising:
-
a document input unit; a processing unit coupled to said document input unit, said processing unit includes; means for identifying document section heading candidates based on known visual indicators; means for identifying document section types based on known section type keywords; means for establishing whether section types can be determined and performing the following; if section types can be determined, processing the section content based on the section type, and outputting the processed document; if section types cannot be determined, identifying section types based on known section type classifiers; means for establishing whether section types can be determined and performing the following; if section types can be determined, outputting the section headings and types to a database, processing the section content based on the section type, and outputting the processed document; and if section types cannot be determined, outputting the sections having undetermined section types to a database; a storage unit coupled to said processing unit; and a document output unit coupled to said output unit. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computer program product comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
-
receive at least one document; identify sections and associated section types within said at least one document; identify sub-sections within said at least one document; define new section types and new sub-section heading constructs; and learn new section heading keywords. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A method, comprising:
-
receiving at least one document; identifying sections and associated section types within said at least one document based on known keywords and section type classifiers; identifying sub-sections within said at least one document based on sub-section heading constructs; defining new section types and new sub-section heading constructs when sections having unknown section types are identified; and learning new section heading keywords when known section types are identified by a section type classifier, instead of the existence of known section type keywords. - View Dependent Claims (22, 23, 24, 25)
-
Specification