ELECTRONIC TABLE OF CONTENTS ENTRY CLASSIFICATION AND LABELING SCHEME
First Claim
1. One or more computer-storage media having computer-executable instructions embodied thereon that, when executed, perform a method for classifying character strings of a table-of-contents (TOC) portion of an electronic document, the method comprising:
- receiving textual data extracted from the electronic document, the textual data comprising one or more character strings of the TOC portion of the electronic document;
extracting semantic information from the textual data of the identified TOC portion;
executing a classification procedure to determine at least one appropriate classification for the one or more character strings of the TOC portion by analyzing the semantic information;
appending one or more labels, selected from a predetermined set of TOC-architecture labels, to the one or more character strings according to the at least one appropriate classification; and
storing the one or more labels in association with the one or more character strings.
2 Assignments
0 Petitions
Accused Products
Abstract
Computer-storage media, computerized methods and systems for classifying character strings within electronic documents are provided. Initially, textual data, which includes one or more character strings, is extracted from an electronic version of a document, typically scanned from a physical document utilizing optical character recognition. The textual data is received at a table-of-contents (TOC) engine that extracts semantic information from the textual data. Sub-engines within the TOC engine analyze the semantic information to determine at least one appropriate classification for character strings within the textual data. Labels selected from a predetermined set of TOC-architecture labels are appended to the character strings according to the appropriate classification. The character strings, and labels appended thereto, are stored in association with each other generating an electronic document file that includes enriched textual data.
66 Citations
20 Claims
-
1. One or more computer-storage media having computer-executable instructions embodied thereon that, when executed, perform a method for classifying character strings of a table-of-contents (TOC) portion of an electronic document, the method comprising:
-
receiving textual data extracted from the electronic document, the textual data comprising one or more character strings of the TOC portion of the electronic document; extracting semantic information from the textual data of the identified TOC portion; executing a classification procedure to determine at least one appropriate classification for the one or more character strings of the TOC portion by analyzing the semantic information; appending one or more labels, selected from a predetermined set of TOC-architecture labels, to the one or more character strings according to the at least one appropriate classification; and storing the one or more labels in association with the one or more character strings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer system for determining a structure of a table-of-contents (TOC) portion of an electronic document, the system comprising:
-
a converter component for receiving textual data extracted from the TOC portion the electronic document, the textual data comprising one or more TOC entries; a TOC engine for classifying one or more elements within the one or more TOC entries of the electronic document, the TOC engine comprising; a featurizer tool for extracting semantic information from the textual data; and a word-label sub-engine for determining at least one appropriate classification for the one or more elements by analyzing the semantic information, and for appending one or more labels, selected from a predetermined set of architecture labels, to the one or more elements according to the at least one appropriate classification; and a merge engine for storing the one or more labels in association with the one or more elements. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A computerized method for classifying character strings within electronic documents, the method comprising:
-
receiving textual data extracted from an electronic document, the textual data comprising one or more character strings, wherein the textual data comprises position values, layout characteristics, and shape characteristics, associated with the one or more characters strings; deriving semantic information from the textual data, wherein deriving semantic information comprises organizing the one or more character strings into groups upon recognizing the shape characteristics and the layout characteristic of the one or more character strings; performing one or more categorization tests that utilize the derived semantic information, wherein each of the one or more categorization tests relates to a respective label in a predetermined set of architecture labels, wherein performing the one or more categorization tests comprises; executing one or more evaluation passes on the one or more character strings, wherein the one or more evaluation passes comprise matching the semantic information associated with the one or more character strings against predefined layout characteristics and predefined shape characteristics; and incrementally adjusting a temporary score, associated with the one or more character strings, based upon results of each of the one or more evaluation passes, wherein adjusting the temporary score is facilitated by a scoring function that receives results determined by the one or more evaluation passes; calculating at least one character-string score based on results determined by each of the one or more categorization tests and the temporary score; appending one or more labels to the one or more character strings according to the at least one character-string score; and serializing the one or more labels in association with the one or more character strings; and training the scoring function according to a correlation between the one or more labels and an actual classification of the one or more character strings.
-
Specification