PAGE CLASSIFIER ENGINE
First Claim
1. One or more computer-storage media having computer-executable instructions embodied thereon that, when executed, perform a method for classifying a page type of a portion of an electronic document, the method comprising:
- receiving an OCR file associated with the portion of the electronic document, wherein the OCR file includes semantic information about text in the portion of the electronic document;
applying one or more features to the semantic information;
based on the application of the one or more features to the semantic information, determining the page type of the portion of the electronic document; and
storing an indication of the page type.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the present invention relate to classifying pages of an electronic document, such as a scanned book page. OCR software is applied to the contents of the electronic document, revealing semantic information about the content of the electronic document. Software-based features are applied to the semantic information to determine the type of page the electronic document is. Page types may include table of contents (TOC), table of figures (TOF), bibliography, index, or other types of pages commonly found in a book, magazine, or other publication. Once determined, the determined page type is stored and used by other software engines.
73 Citations
20 Claims
-
1. One or more computer-storage media having computer-executable instructions embodied thereon that, when executed, perform a method for classifying a page type of a portion of an electronic document, the method comprising:
-
receiving an OCR file associated with the portion of the electronic document, wherein the OCR file includes semantic information about text in the portion of the electronic document; applying one or more features to the semantic information; based on the application of the one or more features to the semantic information, determining the page type of the portion of the electronic document; and storing an indication of the page type. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for determining a page type of a portion of an electronic document, comprising:
-
receiving an OCR file associated with the portion of the electronic document, wherein the OCR file includes semantic information about text in the portion of the electronic document; analyzing a portion of the semantic information by applying one or more features to the semantic information; based on the application of the one or more features, determining the page type of the portion of the electronic document; and storing an indication of the page type. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computerized method for storing an indication that an electronic document is a table of contents (TOC) page type, comprising:
-
receiving semantic information extracted from the electronic document, the semantic information including an indication of the underlying meaning of at least one word; applying one or more features to the semantic information; determining that the electronic document is a TOC page type based on the underlying meaning of the at least one word; and storing an indication indicating that the electronic document is a TOC page type. - View Dependent Claims (20)
-
Specification