Cluster-based identification of news stories

US 9,116,995 B2
Filed: 03/29/2012
Issued: 08/25/2015
Est. Priority Date: 03/30/2011
Status: Active Grant

First Claim

Patent Images

1. A method in a content recommendation system, the method comprising:

identifying a news story about an event, the news story including multiple related content items that each give an account of the event and that each reference multiple entities or categories that are each electronically represented by the content recommendation system, comprising;

processing content items to determine semantic information that includes identified entities and relations between the identified entities;

storing the identified entities and relations in a repository of the content recommendation system;

generating a cluster that includes the multiple related content items, based at least in part on how many entities each of the multiple related content items has in common with one or more other of the multiple related content items, wherein generating the cluster includes;

finding a candidate cluster of a plurality of clusters that is nearest to one of the multiple related content items by computing a cosine distance between a term vector that represents the one content item and a term vector that represents a content item of the candidate cluster; and

determining whether the candidate cluster is a suitable cluster for the one content item, based on all of;

cosine distances between the one content item and content items of the candidate cluster, a quantity of common keyterms between the one content item and content items of the candidate cluster, and on whether a sufficiently high percentage of content items of the candidate cluster have a cosine distance to the content item that is below a predetermined threshold;

if the candidate cluster is determined to be a suitable cluster, adding the one content item to the candidate cluster; and

if the candidate cluster is not determined to be a suitable cluster, creating a new cluster that includes the one content item as a seed; and

storing an indication of the identified news story and the generated cluster.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and techniques for cluster-based content recommendation are described. Some embodiments provide a content recommendation system (“CRS”) configured to recommend news stories about events or occurrences. In some embodiments, a news story about an event includes multiple related content items that each include an account of the event and that each reference one or more entities or categories that are represented by the CRS. In one embodiment, the CRS identifies news stories by generating clusters of related content items. Then, in response to a received query that indicates a keyterm, entity, or category, the CRS determines and provides indications of one or more news stories that are relevant to the received query. In some embodiments, at least some of these techniques are employed to implement a news story recommendation facility in an online news service.

147 Citations

21 Claims

1. A method in a content recommendation system, the method comprising:
- identifying a news story about an event, the news story including multiple related content items that each give an account of the event and that each reference multiple entities or categories that are each electronically represented by the content recommendation system, comprising;
  
  processing content items to determine semantic information that includes identified entities and relations between the identified entities;
  
  storing the identified entities and relations in a repository of the content recommendation system;
  
  generating a cluster that includes the multiple related content items, based at least in part on how many entities each of the multiple related content items has in common with one or more other of the multiple related content items, wherein generating the cluster includes;
  
  finding a candidate cluster of a plurality of clusters that is nearest to one of the multiple related content items by computing a cosine distance between a term vector that represents the one content item and a term vector that represents a content item of the candidate cluster; and
  
  determining whether the candidate cluster is a suitable cluster for the one content item, based on all of;
  
  cosine distances between the one content item and content items of the candidate cluster, a quantity of common keyterms between the one content item and content items of the candidate cluster, and on whether a sufficiently high percentage of content items of the candidate cluster have a cosine distance to the content item that is below a predetermined threshold;
  
  if the candidate cluster is determined to be a suitable cluster, adding the one content item to the candidate cluster; and
  
  if the candidate cluster is not determined to be a suitable cluster, creating a new cluster that includes the one content item as a seed; and
  
  storing an indication of the identified news story and the generated cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1 wherein finding the candidate cluster that is nearest to one of the multiple related content items includes comparing the one content item to content items of the candidate cluster.
  - 3. The method of claim 1 wherein finding the candidate cluster that is nearest to one of the multiple related content items includes comparing the one content item to a centroid of the candidate cluster.
  - 4. The method of claim 1 wherein finding the candidate cluster includes computing a cosine distance between a term vector that represents the one content item and a term vector that represents a content item of the candidate cluster.
  - 5. The method of claim 1 wherein finding the candidate cluster includes finding a cluster that includes a content item that has a cosine distance to the one content item that is lower than cosine distances between the one content item and other content items of other clusters.
  - 6. The method of claim 1 wherein identifying the news story includes processing only content items published during a time interval that is about one day in length.
  - 7. The method of claim 1 wherein identifying the news story includes reassigning content items from clusters that are smaller than a specified size to clusters that are larger than the specified size.
  - 8. The method of claim 1 wherein identifying the news story includes merging two clusters into a single cluster when distances between centroids of the two clusters are lower than a specified threshold.
  - 9. The method of claim 1 wherein identifying the news story includes generating two or more sub-clusters of the generated cluster, each sub-cluster including one or more of the multiple related content items.
  - 10. The method of claim 9 wherein generating the two or more sub-clusters includes decomposing the multiple content items using a k-means process.
  - 11. The method of claim 9 wherein generating the two or more sub-clusters includes discarding a candidate sub-cluster if a distance measured between a centroid of the generated cluster and a centroid of the candidate sub-cluster is lower than a specified threshold.
  - 12. The method of claim 9 wherein generating the two or more sub-clusters includes retaining a candidate sub-cluster if a distance measured between a centroid of the generated cluster and a centroid of the candidate sub-cluster is greater than a specified threshold.
  - 13. The method of claim 1 wherein identifying the news story includes determining a representative content item for the news story by selecting one of the multiple related content items that is nearest to a centroid of the generated cluster.
  - 14. The method of claim 1 wherein storing the indication of the identified news story and the generated cluster includes storing an association between a keyterm, entity, or category and the generated cluster, along with an indicator of relevance of the keyterm, entity, or category to the generated cluster.
  - 15. The method of claim 1 wherein storing the indication of the identified news story and the generated cluster includes storing one or more of:
    - a representative content item for the identified news story;
      
      a representative image for the identified news story;
      
      a centroid of the generated cluster, the centroid including a vector of keyterms and/or entity identifiers;
      
      top categories for the identified news story;
      
      two or more sub-clusters for the identified news story;
      
      a growth rate of the generated cluster; and
      
      a date.
  - 16. The method of claim 1, further comprising:
    - receiving a search query that includes an indication of a keyterm, entity or category;
      
      selecting a news story from a plurality of news stories, the selecting based on how many keyterms, entities, or categories are in common between the received search query and the multiple content items of the selected news story; and
      
      transmitting an indication of the selected news story.
  - 17. The method of claim 16, further comprising:
    - selecting multiple news stories that are each relevant to the received search query; and
      
      sorting the multiple selected news stories based on one or more of;
      
      the number of content items in each news story, a rate of growth of the number of content items in each news story, an importance of the indicated keyterm, entity, or category to content items in each news story, an age of each news story.
  - 18. The method of claim 1, wherein processing the content items to determine semantic information includes determining keyterms, entities, and categories referenced by the content items, wherein the entities and categories are represented in a taxonomic hierarchy that is a graph of nodes connected to one another by links, wherein each node represents an entity or a category, and wherein each link represents a relation between a first entity or category and a second entity or category.

19. A computing system configured to recommend content, comprising:
- a memory;
  
  a module stored on the memory that is configured, when executed, to identify a news story about an event, the news story including multiple related content items that each give an account of the event and that each reference multiple entities or categories that are each electronically represented by the content recommendation system, by;
  
  processing content items to determine semantic information that includes identified entities and relations between the identified entities;
  
  storing the identified entities and relations in a repository of the content recommendation system;
  
  generating a cluster that includes the multiple related content items, based at least in part on how many entities each of the multiple related content items has in common with one or more other of the multiple related content items, wherein generating the cluster includes;
  
  finding a candidate cluster of a plurality of clusters that is nearest to one of the multiple related content items by computing a cosine distance between a term vector that represents the one content item and a term vector that represents a content item of the candidate cluster;
  
  determining whether the candidate cluster is a suitable cluster for the one content item, based on all of;
  
  cosine distances between the one content item and content items of the candidate cluster, a quantity of common keyterms between the one content item and content items of the candidate cluster, and on whether a sufficiently high percentage of content items of the candidate cluster have a cosine distance to the content item that is below a predetermined threshold;
  
  if the candidate cluster is determined to be a suitable cluster, adding the one content item to the candidate cluster; and
  
  if the candidate cluster is not determined to be a suitable cluster, creating a new cluster that includes the one content item as a seed; and
  
  storing an indication of the identified news story and the generated cluster.
- View Dependent Claims (20, 21)
- - 20. The computing system of claim 19 wherein the computing system is a mobile computing device and the module is a content recommendation module.
  - 21. The computing system of claim 19 wherein the module is configured to recommend news stories to at least one of a personal digital assistant, a smart phone, a laptop computer, a tablet computer, and/or a third-party application.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
VCVC III LLC (Vulcan, Inc.)
Original Assignee
VCVC III LLC (Vulcan, Inc.)
Inventors
Koperski, Krzysztof, Bhatti, Satish, Liang, Jisheng, Klein, Adrian
Primary Examiner(s)
Nguyen, Loan T

Application Number

US13/434,600
Publication Number

US 20120254188A1
Time in Patent Office

1,244 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/3334   Selection or weighting of t...

G06F 16/335   Filtering based on addition...

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 16/355   Class or cluster creation o...

G06F 16/38   Retrieval characterised by ...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Cluster-based identification of news stories

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

147 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Cluster-based identification of news stories

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

147 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others