Scalable mining of trending insights from text

US 10,733,221 B2
Filed: 03/30/2016
Issued: 08/04/2020
Est. Priority Date: 03/30/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

storing, in an electronic data store, a plurality of digital documents;

accessing the electronic data store to identify a first plurality of topics in the plurality of digital documents;

determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics;

based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics;

based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics;

for each topic in the second plurality of topics;

determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period;

determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods;

generating a trending score for said each topic based on the one or more frequencies and the particular frequency;

generating a ranking of the second plurality of topics based on the trending score for each topic in the second plurality of topics;

causing the second plurality of topics to be arranged on a screen of a computing device based on the ranking of the second plurality of topics;

wherein the method is performed by one or more computing devices.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for identifying trending topics in a document corpus are provided. First, multiple topics are identified, some of which topics may be filtered or removed based on co-occurrence. Then, for each remaining topic, a frequency of the topic in the document corpus is determined, one or more frequencies of the topic in one or more other document corpora are determined, a trending score of the topic is generated based on the determined frequencies. Lastly, the remaining topics are ranked based on the generated trending scores.

Citations

20 Claims

1. A method comprising:
- storing, in an electronic data store, a plurality of digital documents;
  
  accessing the electronic data store to identify a first plurality of topics in the plurality of digital documents;
  
  determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics;
  
  based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics;
  
  based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics;
  
  for each topic in the second plurality of topics;
  
  determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period;
  
  determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods;
  
  generating a trending score for said each topic based on the one or more frequencies and the particular frequency;
  
  generating a ranking of the second plurality of topics based on the trending score for each topic in the second plurality of topics;
  
  causing the second plurality of topics to be arranged on a screen of a computing device based on the ranking of the second plurality of topics;
  
  wherein the method is performed by one or more computing devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - storing a plurality of document corpora, wherein each document corpus of the plurality of document corpora is associated with a different time period of a plurality of time periods that includes the one or more time periods and the particular time period;
      
      for a first document corpus of the plurality of document corpora;
      
      analyzing the first document corpus to identify a first set of topics, andfor each topic in the first set of topics, determining a number of instances, in the first document corpus, of said each topic;
      
      for a second document corpus of the plurality of document corpora;
      
      analyzing the second document corpus to identify a second set of topics, andfor each topic in the second set of topics, determining a number of instances, in the second document corpus, of said each topic.
  - 3. The method of claim 1, wherein:
    - the one or more periods are a plurality of periods;
      
      the one or more frequencies are a plurality of frequencies;
      
      each frequency in the plurality of frequencies corresponds to a different period of the plurality of periods;
      
      generating the trending score comprises generating the trending score based on each individual frequency in the plurality of frequencies and the particular frequency.
  - 4. The method of claim 3, wherein:
    - generating the trending score comprises calculating a difference between the particular frequency and an aggregation of the plurality of frequencies, wherein the aggregation involves computing an average or a median of multiple frequency-related values.
  - 5. The method of claim 4, wherein:
    - generating the trending score comprises calculating a ratio of the difference and the aggregation.
  - 6. The method of claim 1 wherein generating the trending score comprises:
    - selecting, based on the one or more frequencies, a smoother coefficient that reduces the sensitivity of a normalized difference between the particular frequency and a past frequency that is based on the one or more frequencies;
      
      generating the trending score based on the smoother coefficient and a difference between the particular frequency and the past frequency.
  - 7. The method of claim 6, wherein generating the trending score comprises:
    - for a first topic in the plurality of topics;
      
      determining one or more first frequencies of the first topic;
      
      determining a first current frequency of the first topic;
      
      selecting, based on the one or more first frequencies, a first smoother coefficient that reduces the sensitivity of a first normalized difference between the first current frequency and a first past frequency that is based on the one or more first frequencies;
      
      generating a first trending score based on the first smoother coefficient and a difference between the first current frequency and the first past frequency;
      
      for a second topic, in the plurality of topics, that is different than the first topic;
      
      determining one or more second frequencies of the second topic;
      
      determining a second current frequency of the second topic;
      
      selecting, based on the one or more second frequencies, a second smoother coefficient that is different than the first smoother coefficient that reduces the sensitivity of a second normalized difference between the second current frequency and a second past frequency that is based on the one or more second frequencies;
      
      generating a second trending score based on the second smoother coefficient and a difference between the second current frequency and the second past frequency.
  - 8. The method of claim 6, further comprising:
    - determining which topics in the plurality of topics were selected based on user input;
      
      based on the user input, adjusting a smoother function that generates the smoother coefficient.
  - 9. The method of claim 1, wherein determining the co-occurrence of pairs of topics in the first plurality of topics comprises limiting the determining to the same sentence, wherein a pair of topics co-occur only if both topics appear in the same sentence.
  - 10. The method of claim 1, wherein a document in the plurality of digital documents is a blog post, a comment on an online posting, or a tweet.
  - 11. The method of claim 1, further comprising:
    - for each topic of the first plurality of topics;
      
      storing, in a second electronic data store, in association with said each topic, (1) a list of document identifiers, each of which identifies a digital document in which said each topic was detected and (2) a list of section identifiers that correspond to the list of document identifiers and identifies a section, of one of the digital documents identified by a document identifier in the list, in which said each topic was detected;
      
      wherein determining the co-occurrence of each pair of topics in the plurality of pairs of topics in the first plurality of topics comprises, for each pair of topics in the plurality of pairs of topics;
      
      identifying a first document identifier and a first section identifier of a first topic in said each pair of topics;
      
      identifying a second document identifier and a second section identifier of a second topic in said each pair of topics;
      
      determining that the first topic and the second topic co-occur in a digital document in response to determining that the first document identifier matches the second document identifier and that the first section identifier matches the second section identifier.

12. A system comprising:
- one or more processors;
  
  one or more storage media storing instructions which, when executed by the one or more instructions, cause;
  
  storing, in a database, a plurality of digital documents;
  
  accessing the database to identify a first plurality of topics within digital text of the plurality of digital documents;
  
  determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics;
  
  based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics;
  
  based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics;
  
  for each topic in the second plurality of topics;
  
  determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period;
  
  determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods;
  
  generating a trending score for said each topic based on the one or more frequencies and the particular frequency;
  
  ranking the second plurality of topics based on the trending score for each topic in the second plurality of topics.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The system of claim 12, wherein the instructions, when executed by the one or more processors, further cause:
    - storing a plurality of document corpora, wherein each document corpus of the plurality of document corpora is associated with a different time period of a plurality of time periods that includes the one or more time periods and the particular time period;
      
      for a first document corpus of the plurality of document corpora;
      
      analyzing the first document corpus to identify a first set of topics, andfor each topic in the first set of topics, determining a number of instances, in the first document corpus, of said each topic;
      
      for a second document corpus of the plurality of document corpora;
      
      analyzing the second document corpus to identify a second set of topics, andfor each topic in the second set of topics, determining a number of instances, in the second document corpus, of said each topic.
  - 14. The system of claim 12, wherein:
    - the one or more periods are a plurality of periods;
      
      the one or more frequencies are a plurality of frequencies;
      
      each frequency in the plurality of frequencies corresponds to a different period of the plurality of periods;
      
      generating the trending score comprises generating the trending score based on each individual frequency in the plurality of frequencies and the particular frequency.
  - 15. The system of claim 14, wherein:
    - generating the trending score comprises calculating a difference between the particular frequency and an aggregation of the plurality of frequencies, wherein the aggregation involves computing an average or a median of multiple frequency-related values.
  - 16. The system of claim 15, wherein:
    - generating the trending score comprises calculating a ratio of the difference and the aggregation.
  - 17. The system of claim 12, wherein generating the trending score comprises:
    - selecting, based on the one or more frequencies, a smoother coefficient that reduces the sensitivity of a normalized difference between the particular frequency and the one or more frequencies;
      
      generating the trending score based on the smoother coefficient.
  - 18. The system of claim 17, wherein generating the trending score comprises:
    - for a first topic in the second plurality of topics;
      
      determining one or more first frequencies of the first topic;
      
      determining a first frequency of the first topic;
      
      selecting, based on the one or more first frequencies, a first smoother coefficient;
      
      generating a first trending score based on the one or more first frequencies, the first frequency, and the first smoother coefficient;
      
      for a second topic, in the second plurality of topics, that is different than the first topic;
      
      determining one or more second frequencies of the second topic;
      
      determining a second frequency of the second topic;
      
      selecting, based on the one or more second frequencies, a second smoother coefficient that is different than the first smoother coefficient;
      
      generating a second trending score based on the one or more second frequencies, the second frequency, and the second smoother coefficient.
  - 19. The system of claim 17, wherein the instructions, when executed by the one or more processors, further cause:
    - determining which topics in the second plurality of topics were selected based on user input;
      
      based on the user input, adjusting a smoother function that generates the smoother coefficient.
  - 20. The system of claim 12, wherein determining the co-occurrence of pairs of topics in the first plurality of topics comprises limiting the determining to the same sentence, wherein a pair of topics co-occur only if both topics appear in the same sentence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Zhang, Yongzheng, Zhao, Rui, Kuan, Chi-Yi, Zheng, Yi
Primary Examiner(s)
Beausoliel, Jr., Robert W
Assistant Examiner(s)
Rayyan, Susan F

Application Number

US15/085,714
Publication Number

US 20170286531A1
Time in Patent Office

1,588 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/3346   using probabilistic model

G06F 16/353   into predefined classes

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

Scalable mining of trending insights from text

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable mining of trending insights from text

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links