System and method for clustering content items from content feeds

US 20070226207A1
Filed: 03/27/2006
Published: 09/27/2007
Est. Priority Date: 03/27/2006
Status: Abandoned Application

First Claim

Patent Images

1. A computer system for clustering content, comprising:

content feeds, each having metadata that may be represented by text;

a clustering engine for clustering a text representing metadata of a content feed in a cluster along with one or more other texts representing metadata of the content feeds determined to be a good nearest neighbor of the text; and

a good nearest neighbor analyzer operably coupled to the clustering engine for determining the one or more other texts to be a good nearest neighbor of the text.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An improved system and method for clustering text or content described by text is provided. Each text in a set of texts may be represented as a dimensional vector of words. Singleton texts that may not be similar to another text may be excluded from the set of texts for clustering. Texts identified as good nearest neighbors may then be grouped in the same cluster. In addition, metadata describing content may be used for clustering items of aggregated content from content feeds. Metadata describing items of content from content feeds may be converted into a set of texts and texts identified as good nearest neighbors may then be clustered. Items of content feeds described by the clustered texts may then be similarly clustered. Any types of items of content that may be described by text may be clustered, including audio, images, video, multimedia content, and so forth.

Citations

20 Claims

1. A computer system for clustering content, comprising:
- content feeds, each having metadata that may be represented by text;
  
  a clustering engine for clustering a text representing metadata of a content feed in a cluster along with one or more other texts representing metadata of the content feeds determined to be a good nearest neighbor of the text; and
  
  a good nearest neighbor analyzer operably coupled to the clustering engine for determining the one or more other texts to be a good nearest neighbor of the text.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1 further comprising a content parser operably coupled to the content feeds for identifying the metadata that may be represented by text.
  - 3. The system of claim 1 further comprising a metadata converter operably coupled to the content parser for converting the metadata into text.
  - 4. A computer-readable medium having computer-executable components comprising the system of claim 1.

5. A computer-implemented method for clustering content, comprising:
- converting metadata describing items of content from content feeds into a set of texts;
  
  determining at least one text in a set of texts to be a good nearest neighbor of an other text in the set of texts;
  
  clustering the other text in a cluster;
  
  clustering the at least one text determined to be the good nearest neighbor of the other text in the cluster; and
  
  outputting a cluster of items of content from the content feeds associated with the other text and the at least one text.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 6. The method of claim 5 further comprising parsing each of the content feeds to identify metadata describing the items of content from content feeds.
  - 7. The method of claim 5 further comprising performing text preprocessing for each text in the set of texts.
  - 8. The method of claim 7 wherein performing text preprocessing for each text in the set of texts comprises removing stopwords from each text.
  - 9. The method of claim 5 wherein determining the at least one text in the set of texts to be the good nearest neighbor of the other text in the set of texts comprises:
    - representing the at least one text as a dimensional vector of words;
      
      representing the other text as a dimensional vector of words; and
      
      determining a dot-product of the dimensional vector of the other text and the dimensional vector of the at least one text.
  - 10. The method of claim 9 further comprising:
    - comparing the dot-product to a threshold; and
      
      if the dot-product exceeds the threshold, determining the at least one text to be the good nearest neighbor of the other text.
  - 11. The method of claim 9 further comprising:
    - comparing the dot-product to a similarity threshold;
      
      determining a measure of a number of words used both by the at least one text and the other text;
      
      comparing the measure of the number of words to an overlap threshold; and
      
      if the dot-product exceeds the similarity threshold and the measure of the number of words also exceeds the overlap threshold, determining the at least one text to be the good nearest neighbor of the other text.
  - 12. The method of claim 5 further comprising:
    - determining a text not to be similar to one or more texts in the set of texts; and
      
      excluding the text determined not to be similar from clustering with the set of texts.
  - 13. The method of claim 12 wherein determining the text not to be similar to the one or more texts in the set of texts comprises:
    - representing the text as a dimensional vector of words;
      
      representing each of the one or more texts as a dimensional vector of words; and
      
      for each of the one or more texts, determining a dot-product of the dimensional vector of the each of the one or more texts and the dimensional vector of the text.
  - 14. The method of claim 13 further comprising:
    - comparing the dot-product for each of the one or more texts to a similarity threshold;
      
      for each of the one or more texts, determining a measure of a number of words used both by the text and the each of the one or more texts;
      
      comparing the measure of the number of words for each of the one or more texts to an overlap threshold; and
      
      if the similarity threshold exceeds the dot-product for each of the one or more texts and the overlap threshold exceeds the measure of the number of words for each of the one or more texts, determining the text not to be similar to the one or more texts in the set of texts.
  - 15. The method of claim 5 further comprising determining a similarity matrix of values representing dot-products of normalized dimensional vectors of texts in the set of texts.
  - 16. The method of claim 5 further comprising determining an overlap matrix of values represents measures of words used in common by texts in the set of texts.
  - 17. The method of claim 5 wherein outputting the cluster of items of content from the content feeds associated with the other text and the at least one text comprises including the cluster of items of content in a web page for display as a group to a user.
  - 18. A computer-readable medium having computer-executable instructions for performing the method of claim 5.

19. A computer system for clustering content, comprising:
- means for converting metadata describing items of content from content feeds into a set of texts;
  
  means for determining at least one text in a set of texts to be a good nearest neighbor of an other text in the set of texts;
  
  means for clustering the other text in a cluster;
  
  means for clustering the at least one text determined to be the good nearest neighbor of the other text in the cluster; and
  
  means for outputting a cluster of items of content from the content feeds associated with the other text and the at least one text.
- View Dependent Claims (20)
- - 20. The computer system of claim 19 wherein means for determining the at least one text in the set of texts to be the good nearest neighbor of the other text in the set of texts comprises:
    - means for determining a similarity matrix of values representing dot-products of normalized dimensional vectors of texts in the set of texts; and
      
      means for determining an overlap matrix of values represents measures of words used in common by the texts in the set of texts.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oath Inc. (Verizon Communications Inc.)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Tawde, Vivek

Application Number

US11/389,999
Publication Number

US 20070226207A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/355 Class or cluster creation o...

System and method for clustering content items from content feeds

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for clustering content items from content feeds

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links