Scalable mining of trending insights from text
First Claim
Patent Images
1. A method comprising:
- storing, in an electronic data store, a plurality of digital documents;
accessing the electronic data store to identify a first plurality of topics in the plurality of digital documents;
determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics;
based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics;
based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics;
for each topic in the second plurality of topics;
determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period;
determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods;
generating a trending score for said each topic based on the one or more frequencies and the particular frequency;
generating a ranking of the second plurality of topics based on the trending score for each topic in the second plurality of topics;
causing the second plurality of topics to be arranged on a screen of a computing device based on the ranking of the second plurality of topics;
wherein the method is performed by one or more computing devices.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for identifying trending topics in a document corpus are provided. First, multiple topics are identified, some of which topics may be filtered or removed based on co-occurrence. Then, for each remaining topic, a frequency of the topic in the document corpus is determined, one or more frequencies of the topic in one or more other document corpora are determined, a trending score of the topic is generated based on the determined frequencies. Lastly, the remaining topics are ranked based on the generated trending scores.
-
Citations
20 Claims
-
1. A method comprising:
-
storing, in an electronic data store, a plurality of digital documents; accessing the electronic data store to identify a first plurality of topics in the plurality of digital documents; determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics; based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics; based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics; for each topic in the second plurality of topics; determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period; determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods; generating a trending score for said each topic based on the one or more frequencies and the particular frequency; generating a ranking of the second plurality of topics based on the trending score for each topic in the second plurality of topics; causing the second plurality of topics to be arranged on a screen of a computing device based on the ranking of the second plurality of topics; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system comprising:
-
one or more processors; one or more storage media storing instructions which, when executed by the one or more instructions, cause; storing, in a database, a plurality of digital documents; accessing the database to identify a first plurality of topics within digital text of the plurality of digital documents; determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics; based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics; based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics; for each topic in the second plurality of topics; determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period; determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods; generating a trending score for said each topic based on the one or more frequencies and the particular frequency; ranking the second plurality of topics based on the trending score for each topic in the second plurality of topics. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
Specification