System and method for clustering content according to similarity

US 8,548,969 B2
Filed: 09/30/2010
Issued: 10/01/2013
Est. Priority Date: 06/02/2010
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for clustering content according to similarity, the method comprising:

receiving a set of features for a plurality of content items;

calculating, by a processor, a distance matrix for the plurality of content items based on data indicating user behavior relative to at least some of the content items, wherein the data includes information associated with one or more users accessing at least one of the content items;

labeling, by a processor, at least some of the content items as pairwise constraints based on the distance matrix; and

creating, by a processor, a boosted cluster by incorporating the pairwise constraints into a clustering algorithm.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for clustering content according to similarity are provided that identify and group similar content using a set of tags associated with the content. A topic model of a group of content is built, producing a probability distribution of topic membership for the content. Individual items of content are then clustered using a clustering algorithm, and a distance matrix from the probability distribution is built. Based on the distance matrix, individual items of content are labeled as “must-link” or “cannot-link” pairs with the group of content. The topic model is then embedded into successively smaller dimensions using a kernel method, until the clustering is stable with respect to both the behavioral and content domains.

15 Citations

View as Search Results

18 Claims

1. A computer implemented method for clustering content according to similarity, the method comprising:
- receiving a set of features for a plurality of content items;
  
  calculating, by a processor, a distance matrix for the plurality of content items based on data indicating user behavior relative to at least some of the content items, wherein the data includes information associated with one or more users accessing at least one of the content items;
  
  labeling, by a processor, at least some of the content items as pairwise constraints based on the distance matrix; and
  
  creating, by a processor, a boosted cluster by incorporating the pairwise constraints into a clustering algorithm.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein at least one of the plurality of content and the content items is a document.
  - 3. The method of claim 2, wherein the document is a web page.
  - 4. The method of claim 1, wherein the set of features is in the form of a topic model built using latent Dirichlet allocation (LDA).
  - 5. The method of claim 1, further comprising determining a probably distribution of topics for the plurality of content items.
  - 6. The method of claim 1, further comprising:
    - applying a pattern analysis to the boosted cluster; and
      
      modifying the boosted cluster based on relations identified by the pattern analysis.
  - 7. The method of claim 6, wherein the pattern analysis is applied using a kernel method.
  - 8. The method of claim 1, further comprising identifying topical tags from at least one of an activity log of a search query, markup language code, and text mining.
  - 9. The method of claim 1, wherein the data further includes information identifying at least one of the one or more users.

10. A system for clustering content according to similarity, the system comprising:
- a processor configured to;
  
  receive a set of features for a plurality of content items;
  
  calculate a distance matrix for the plurality of content based on data indicating user behavior relative to at least some of the content items, wherein the data includes information associated with one or more users accessing at least one of the content items;
  
  label content items as a pairwise constraint based on the distance matrix; and
  
  create a boosted cluster by incorporating the pairwise constraint into a clustering algorithm; and
  
  a tangible computer readable media configured store the boosted cluster.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein at least one of the plurality of content items is a document.
  - 12. The system of claim 11, wherein the document is a web page.
  - 13. The system of claim 10, wherein the set of features is in the form of a topic model built using latent Dirichlet allocation (LDA).
  - 14. The system of claim 10, wherein processor is further configured to determine a probably distribution of topics for the plurality of content items.
  - 15. The system of claim 10, wherein the processor is further configured to:
    - apply a pattern analysis to data points representing the boosted cluster; and
      
      modify the boosted cluster based on relations identified by the pattern analysis.
  - 16. The system of claim 15, wherein the pattern analysis is applied by the processor using a kernel method.
  - 17. The system of claim 10, wherein the processor is further configured to identify topical tags from at least one of an activity log of a search query, markup language code, and text mining.
  - 18. The system of claim 10, wherein the data further includes information identifying at least one of the one or more users.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CBS Interactive Inc. (Paramount Global (f/k/a ViacomCBS Inc.))
Original Assignee
CBS Interactive Inc. (Paramount Global (f/k/a ViacomCBS Inc.))
Inventors
RHINELANDER, Ned, Lyon, Clifford
Primary Examiner(s)
Woo, Isaac M

Application Number

US12/895,075
Publication Number

US 20110302163A1
Time in Patent Office

1,097 Days
Field of Search

707600-899
US Class Current

707/706
CPC Class Codes

G06F 16/22 Indexing; Data structures t...

G06F 16/285 Clustering or classification

System and method for clustering content according to similarity

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for clustering content according to similarity

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links