Identifying topics in a digital work
First Claim
1. One or more non-transitory computer-readable media maintaining instructions executable by one or more processors to perform operations comprising:
- extracting text from a digital work;
identifying a plurality of noun phrases from the text extracted from the digital work;
searching a network accessible resource having a plurality of entries to identify a set of one or more entries in the network accessible resource that contain information related to at least one noun phrase of the plurality of noun phrases, wherein each noun phrase corresponding to an entry in the set of one or more entries is a candidate topic in a set of candidate topics;
ranking the candidate topics based, at least in part, on at least one of a number of incoming links or a number of outgoing links between each of the entries corresponding to the candidate topics;
excluding, from the set of candidate topics, one or more candidate topics ranked below a first threshold;
comparing a first term frequency-inverse document frequency (tf-idf) value with a second tf-idf value, wherein the first tf-idf value is determined with respect to the digital work for each candidate topic remaining in the set of candidate topics, and wherein the second tf-idf value is determined for the candidate topics with respect to a corpus of works;
excluding, from the set of candidate topics, one or more candidate topics for which a difference between the first tf-idf value and the second tf-idf value is less than a second threshold;
generating a digital supplemental information file comprising at least one reference to supplemental information relating to at least one candidate topic remaining in the set of candidate topics;
receiving a request for the digital supplemental information file from an electronic device; and
transmitting the digital supplemental information file to the electronic device, the digital supplemental information file to cause the digital work to include at least one selectable portion that enables display of the at least one reference to supplemental information and a visual representation of at least a location in the digital work of each occurrence of the at least one candidate topic remaining in the set of candidate topics, wherein the visual representation comprises an object with markings corresponding to each occurrence.
1 Assignment
0 Petitions
Accused Products
Abstract
In some implementations, text is extracted from a digital work and a plurality of noun phrases are identified. The noun phrases are checked against a network accessible resource, such as an online encyclopedia, that includes a plurality of interlinked article entries. The noun phrases that have corresponding entries in the network accessible resource are included in a set of candidate topics. The candidate topics are ranked based, at least in part, on the links to and from each of the entries corresponding to the candidate topics. Candidate topics below a ranking threshold are removed from the set of candidate topics. Further, term frequency information for each candidate topic in relation to the digital work is compared against term frequency information for the candidate topic in a large corpus of textual works to remove candidate topics within a frequency difference threshold.
-
Citations
25 Claims
-
1. One or more non-transitory computer-readable media maintaining instructions executable by one or more processors to perform operations comprising:
-
extracting text from a digital work; identifying a plurality of noun phrases from the text extracted from the digital work; searching a network accessible resource having a plurality of entries to identify a set of one or more entries in the network accessible resource that contain information related to at least one noun phrase of the plurality of noun phrases, wherein each noun phrase corresponding to an entry in the set of one or more entries is a candidate topic in a set of candidate topics; ranking the candidate topics based, at least in part, on at least one of a number of incoming links or a number of outgoing links between each of the entries corresponding to the candidate topics; excluding, from the set of candidate topics, one or more candidate topics ranked below a first threshold; comparing a first term frequency-inverse document frequency (tf-idf) value with a second tf-idf value, wherein the first tf-idf value is determined with respect to the digital work for each candidate topic remaining in the set of candidate topics, and wherein the second tf-idf value is determined for the candidate topics with respect to a corpus of works; excluding, from the set of candidate topics, one or more candidate topics for which a difference between the first tf-idf value and the second tf-idf value is less than a second threshold; generating a digital supplemental information file comprising at least one reference to supplemental information relating to at least one candidate topic remaining in the set of candidate topics; receiving a request for the digital supplemental information file from an electronic device; and transmitting the digital supplemental information file to the electronic device, the digital supplemental information file to cause the digital work to include at least one selectable portion that enables display of the at least one reference to supplemental information and a visual representation of at least a location in the digital work of each occurrence of the at least one candidate topic remaining in the set of candidate topics, wherein the visual representation comprises an object with markings corresponding to each occurrence. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method comprising:
-
under control of one or more processors configured with executable instructions, searching a network accessible resource for at least one entry corresponding to at least one noun phrase obtained from a digital work; identifying the at least one entry; generating a set of candidate topics from the at least one noun phrase corresponding to the at least one entry identified; for at least one candidate topic of the set of candidate topics; comparing a first indication of a frequency of the at least one candidate topic in the digital work with a second indication of a frequency of the at least one candidate topic in a corpus of digital works, and removing the at least one candidate topic from the set of candidate topics based, at least partly, on a difference between the first indication and the second indication being less than a threshold amount; generating a digital supplemental information file comprising at least one reference to supplemental information relating to at least one candidate topic remaining in the set of candidate topics; receiving a request for the digital supplemental information file from an electronic device; and transmitting the digital supplemental information file to the electronic device, the digital supplemental information file to cause the digital work to include at least one selectable portion that enables display of the at least one reference to supplemental information and a visual representation of at least a location in the digital work of each occurrence of the at least one candidate topic remaining in the set of candidate topics wherein the visual representation comprises an object with markings corresponding to each occurrence. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system comprising:
-
one or more processors; one or more computer-readable media; and one or more modules maintained on the one or more computer-readable media to be executed by the one or more processors to perform operations including; obtaining a plurality of noun phrases from a digital work; searching a network accessible resource having a plurality of entries to identify a set of one or more entries that correspond to one or more noun phrases of the plurality of noun phrases; generating a set of candidate topics from the one or more noun phrases; removing, from the set of candidate topics, at least one candidate topic based, at least partly, on a difference between a first indication of a frequency of the at least one candidate topic in the digital work and a second indication of a frequency of the at least one candidate topic in a collection of digital works being within a threshold; generating a digital supplemental information file comprising at least one reference to supplemental information relating to at least one candidate topic remaining in the set of candidate topics; receiving a request for the digital supplemental information file from an electronic device; and transmitting the digital supplemental information file to the electronic device, the digital supplemental information file to cause the digital work to include at least one selectable portion that enables display of the at least one reference to supplemental information and a visual representation of at least a location in the digital work of each occurrence of the at least one candidate topic remaining in the set of candidate topics wherein the visual representation comprises an object with markings corresponding to each occurrence. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
Specification