Page classifier engine
First Claim
1. A method for determining a page type of a first portion of an electronic document utilizing one or more second portions of the electronic document, comprising:
- receiving an OCR file associated with the electronic document, wherein the OCR file includes semantic information about text in the electronic document;
analyzing the semantic information about the text in a first portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising assigning a weight to the following;
key phrases, a size of a word or phrase, font, a location of a word or phrase, one or more page numbers, and a repletion of a word or phrase in the first portion of the electronic document;
using the semantic information about the text in the first portion of the electronic document to automatically reference a second portion of the electronic document;
extracting semantic information about the text in the second portion of the electronic document;
analyzing the semantic information about the text in the second portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising determining that the second portion contains a page-type title, figure, or text similar to a page-type title, figure, or text in the first portion of the electronic document;
determining the page type of the first portion of the electronic document based at least upon the application of the one or more features to the semantic information in the first portion of the electronic document and the application of the one or more features to the semantic information in the second portion of the electronic document; and
storing an indication of the page type.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the present invention relate to classifying pages of an electronic document, such as a scanned book page. OCR software is applied to the contents of the electronic document, revealing semantic information about the content of the electronic document. Software-based features are applied to the semantic information to determine the type of page the electronic document is. Page types may include table of contents (TOC), table of figures (TOF), bibliography, index, or other types of pages commonly found in a book, magazine, or other publication. Once determined, the determined page type is stored and used by other software engines.
56 Citations
15 Claims
-
1. A method for determining a page type of a first portion of an electronic document utilizing one or more second portions of the electronic document, comprising:
-
receiving an OCR file associated with the electronic document, wherein the OCR file includes semantic information about text in the electronic document; analyzing the semantic information about the text in a first portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising assigning a weight to the following;
key phrases, a size of a word or phrase, font, a location of a word or phrase, one or more page numbers, and a repletion of a word or phrase in the first portion of the electronic document;using the semantic information about the text in the first portion of the electronic document to automatically reference a second portion of the electronic document; extracting semantic information about the text in the second portion of the electronic document; analyzing the semantic information about the text in the second portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising determining that the second portion contains a page-type title, figure, or text similar to a page-type title, figure, or text in the first portion of the electronic document; determining the page type of the first portion of the electronic document based at least upon the application of the one or more features to the semantic information in the first portion of the electronic document and the application of the one or more features to the semantic information in the second portion of the electronic document; and storing an indication of the page type. - View Dependent Claims (2, 3, 4, 5)
-
-
6. One or more computer-storage memory having computer-executable instructions embodied thereon that, when executed by a computing device, perform a method for classifying a page type of a first portion of an electronic document, the method comprising:
-
receiving an OCR file associated with the electronic document, wherein the OCR file includes semantic information about text in the electronic document; analyzing the semantic information about the text in a first portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising assigning a weight to the following;
key phrases, a size of a word or phrase, font, a location of a word or phrase, one or more page numbers, and a repletion of a word or phrase in the first portion of the electronic document;using the semantic information about the text in the first portion of the electronic document to automatically reference a second portion of the electronic document; extracting semantic information about the text in the second portion of the electronic document; analyzing the semantic information about the text in the second portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising determining that the second portion contains a page-type title, figure, or text similar to a page-type title, figure, or text in the first portion of the electronic document; determining the page type of the first portion of the electronic document based at least upon the application of the one or more features to the semantic information in the first portion of the electronic document and the application of the one or more features to the semantic information in the second portion of the electronic document; and storing an indication of the page type. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system for classifying a page type of a first portion of an electronic document, the system comprising:
-
a computing device associated with a page classifier engine having one or more processors and one or more computer-storage media; and a data store coupled with the page classifier engine, wherein the page classifier engine; receives an OCR file associated with the electronic document, wherein the OCR file includes semantic information about text in the electronic document; analyzes the semantic information about the text in a first portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising assigning a weight to the following;
key phrases, a size of a word or phrase, font, a location of a word or phrase, one or more page numbers, and a repletion of a word or phrase in the first portion of the electronic document;uses the semantic information about the text in the first portion of the electronic document to automatically reference a second portion of the electronic document; extracts semantic information about the text in the second portion of the electronic document; analyzes the semantic information about the text in the second portion of the electronic document by applying one or more features to the semantic information, the one or more features comprising determining that the second portion contains a page-type title, figure, or text similar to a page-type title, figure, or text in the first portion of the electronic document; determines the page type of the first portion of the electronic document based at least upon the application of the one or more features to the semantic information in the first portion of the electronic document and the application of the one or more features to the semantic information in the second portion of the electronic document; and stores an indication of the page type. - View Dependent Claims (12, 13, 14, 15)
-
Specification