Determining section information of a digital volume
First Claim
1. A computer-implemented method for determining section information of a digital volume, the method comprising:
- determining pages of the digital volume containing a table of contents by applying a classifier to the digital volume, the classifier adapted to use machine learning to recognize references to sections in a body of the digital volume and identify pages of the digital volume containing the section references as the pages containing the table of contents;
and generating a score estimating an accuracy of a classification of a page as containing the table of contents;
extracting phrases from the table of contents of the digital volume;
identifying matching phrases in the body of the digital volume, the matching phrases at least approximately matching the extracted phrases;
determining best matching phrases from the identified matching phrases, the best matching phrases comprising a matching phrase corresponding to each extracted phrase, the determining based at least in part on the ordering of the extracted phrases and the identified matching phrases;
generating section information, the section information comprising section headings and section start locations, the section headings comprising the best matching phrases, and the section start locations indicating starting locations of the sections in the digital volume, the section start locations comprising the locations of the best matching phrases in the digital volume; and
storing the section information.
2 Assignments
0 Petitions
Accused Products
Abstract
A system, method, and computer program determines section information of a digital volume. Digital volumes include digital representations of human-readable content, such as digitized books. Phrases are extracted from a table of contents of a digital volume. Matching phrases that at least approximately match the extracted phrases are identified in the body of the digital volume. A best matching phrase is determined for each extracted phrase based on the ordering of the extracted phrases and the matching phrases, and based on match scores indicating the quality of the matches. Section information is generated, including section headings and section start locations based on the best matching phrases. The digital volume is presented to users with links from the table of contents to the section headings on the section start pages. The section information is also used to enhance searching of the digital volume by users.
-
Citations
27 Claims
-
1. A computer-implemented method for determining section information of a digital volume, the method comprising:
-
determining pages of the digital volume containing a table of contents by applying a classifier to the digital volume, the classifier adapted to use machine learning to recognize references to sections in a body of the digital volume and identify pages of the digital volume containing the section references as the pages containing the table of contents; and generating a score estimating an accuracy of a classification of a page as containing the table of contents; extracting phrases from the table of contents of the digital volume; identifying matching phrases in the body of the digital volume, the matching phrases at least approximately matching the extracted phrases; determining best matching phrases from the identified matching phrases, the best matching phrases comprising a matching phrase corresponding to each extracted phrase, the determining based at least in part on the ordering of the extracted phrases and the identified matching phrases; generating section information, the section information comprising section headings and section start locations, the section headings comprising the best matching phrases, and the section start locations indicating starting locations of the sections in the digital volume, the section start locations comprising the locations of the best matching phrases in the digital volume; and storing the section information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 26, 27)
-
-
9. A computer system for determining section information of a digital volume, the system comprising:
-
a non-transitory computer-readable storage medium storing executable computer program modules comprising; a classifier module for determining pages of the digital volume containing a table of contents by applying a classifier to the digital volume, the classifier adapted to use machine learning to recognize references to sections in a body of the digital volume and identify pages of the digital volume containing the section references as the pages containing the table of contents; and generating a score estimating an accuracy of a classification of a page as containing the table of contents; a phrase extraction module for extracting phrases from the table of contents of the digital volume, the digital volume comprising a plurality of sections; a phrase matching module for identifying matching phrases in the body of the digital volume, the matching phrases at least approximately matching the extracted phrases; and a match selection module for; determining best matching phrases from the identified matching phrases, the best matching phrases comprising a matching phrase corresponding to each extracted phrase, the determining based at least in part on the ordering of the extracted phrases and the identified matching phrases; generating section information, the section information comprising section headings and section start locations, the section headings comprising the best matching phrases, and the section start locations indicating starting locations of the sections in the digital volume, the section start locations comprising the locations of the best matching phrases in the digital volume; and storing the section information; and a processor for executing the computer program modules stored in the computer-readable storage medium. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product having a non-transitory computer-readable storage medium having executable computer program instructions recorded thereon, the computer program instructions when executed by a processor causing the computer to determine section information of a digital volume, comprising:
-
a classifier module for determining pages of the digital volume containing a table of contents by applying a classifier to the digital volume, the classifier adapted to use machine learning to recognize references to sections in a body of the digital volume and identify pages of the digital volume containing the section references as the pages containing the table of contents; and generating a score estimating an accuracy of a classification of a page as containing the table of contents; a phrase extraction module for extracting phrases from the table of contents of the digital volume, the digital volume comprising a plurality of sections; a phrase matching module for identifying matching phrases in the body of the digital volume, the matching phrases at least approximately matching the extracted phrases; and a match selection module for; determining best matching phrases from the identified matching phrases, the best matching phrases comprising a matching phrase corresponding to each extracted phrase, the determining based at least in part on the ordering of the extracted phrases and the identified matching phrases; generating section information, the section information comprising section headings and section start locations, the section headings comprising the best matching phrases, and the section start locations indicating starting locations of the sections in the digital volume, the section start locations comprising the locations of the best matching phrases in the digital volume; and storing the section information. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer-implemented method of determining a location of a section within a digital volume, the method comprising:
-
determining a portion of the digital volume containing a table of contents by applying a classifier to the digital volume, the classifier adapted to use machine learning to recognize references to sections in a body of the digital volume and identify portions of the digital volume containing the section references as the portion containing the table of contents; and generating a score estimating an accuracy of a classification of a page as containing the table of contents; extracting ordered text phrases from the identified portion of the digital volume determined to contain the table of contents, the ordered text phrases referencing a plurality of sections in the body of the digital volume; identifying ordered text phrases in the body of the digital volume that at least approximately match the extracted text phrases; determining an identified text phrase that best matches an extracted text phrase responsive at least in part to a match score indicating a quality of a match between the identified text phrase and the extracted text phrase and an ordering constraint determined based on an order of the extracted text phrases and an order of the identified text phrases; generating section information indicating that a location of the identified best-matching text phrase is a starting location of a section in the body of the digital volume referenced by the extracted text phrase; and storing the section information.
-
Specification