System and method for good nearest neighbor clustering of text

US 7,747,083 B2
Filed: 03/27/2006
Issued: 06/29/2010
Est. Priority Date: 03/27/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for clustering text, comprising:

representing at least one text in a set of texts as a dimensional vector of words;

representing an other text in the set of texts as a dimensional vector of words;

determining a dot-product of the dimensional vector of the other text and the dimensional vector of the at least one text;

comparing the dot-product to a threshold, wherein the threshold comprises an upper bound of a value in a range from zero to one that represents a cosine similarity between the other text and the at least one text;

if the dot-product exceeds the threshold, determining the at least one text to be the good nearest neighbor of the other text;

clustering the other text in a cluster; and

clustering the at least one text determined to be the good nearest neighbor of the other text in the cluster.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An improved system and method for clustering text or content described by text is provided. Each text in a set of texts may be represented as a dimensional vector of words. Singleton texts that may not be similar to another text may be excluded from the set of texts for clustering. Texts identified as good nearest neighbors may then be grouped in the same cluster. In addition, metadata describing content may be used for clustering items of aggregated content from content feeds. Metadata describing items of content from content feeds may be converted into a set of texts and texts identified as good nearest neighbors may then be clustered. Items of content feeds described by the clustered texts may then be similarly clustered. Any types of items of content that may be described by text may be clustered, including audio, images, video, multimedia content, and so forth.

17 Citations

View as Search Results

10 Claims

1. A computer-implemented method for clustering text, comprising:
- representing at least one text in a set of texts as a dimensional vector of words;
  
  representing an other text in the set of texts as a dimensional vector of words;
  
  determining a dot-product of the dimensional vector of the other text and the dimensional vector of the at least one text;
  
  comparing the dot-product to a threshold, wherein the threshold comprises an upper bound of a value in a range from zero to one that represents a cosine similarity between the other text and the at least one text;
  
  if the dot-product exceeds the threshold, determining the at least one text to be the good nearest neighbor of the other text;
  
  clustering the other text in a cluster; and
  
  clustering the at least one text determined to be the good nearest neighbor of the other text in the cluster.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 further comprising:
    - comparing the dot-product to a similarity threshold;
      
      determining a measure of a number of words used both by the at least one text and the other text;
      
      comparing the measure of the number of words to an overlap threshold; and
      
      if the dot-product exceeds the similarity threshold and the measure of the number of words also exceeds the overlap threshold, determining the at least one text to be the good nearest neighbor of the other text.
  - 3. The method of claim 2 wherein the similarity threshold comprises a lower bound of a value in a range from zero to one that represents a cosine similarity between the other text and the at least one text.
  - 4. The method of claim 2 wherein the overlap threshold comprises a value in a range from zero to one that represents a measure of words used in common by the other text and the at least one text.

5. A computer-implemented method for clustering text, comprising:
- representing a text in a set of texts as a dimensional vector of words;
  
  representing each of one or more texts in the set of texts as a dimensional vector of words;
  
  for each of the one or more texts, determining a dot-product of the dimensional vector of the each of the one or more texts and the dimensional vector of the text;
  
  comparing the dot-product for each of the one or more texts to a similarity threshold;
  
  for each of the one or more texts, determining a measure of a number of words used both by the text and the each of the one or more texts;
  
  comparing the measure of the number of words for each of the one or more texts to an overlap threshold;
  
  if the similarity threshold exceeds the dot-product for each of the one or more texts and the overlap threshold exceeds the measure of the number of words for each of the one or more texts, determining the text not to be similar to the one or more texts in the set of texts;
  
  excluding the text determined not to be similar from clustering with the set of texts;
  
  determining at least one text of the one or more texts in the set of texts to be a good nearest neighbor of an other text of the one or more texts in the set of texts;
  
  clustering the other text in a cluster; and
  
  clustering the at least one text determined to be the good nearest neighbor of the other text in the cluster.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The method of claim 5 wherein the similarity threshold comprises a lower bound of a value in a range from zero to one that represents a cosine similarity between the text and the each of the one or more texts.
  - 7. The method of claim 5 wherein the overlap threshold comprises a value in a range from zero to one that represents the measure of words used in common by the text and the each of the one or more texts.
  - 8. The method of claim 5 further comprising determining a similarity matrix of values representing dot-products of normalized dimensional vectors of texts in the set of texts.
  - 9. The method of claim 5 further comprising determining an overlap matrix of values representing measures of words used in common by texts in the set of texts.

10. A computer-readable storage medium having computer-executable instructions for performing the steps of:
- representing at least one text in a set of texts as a dimensional vector of words;
  
  representing an other text in the set of texts as a dimensional vector of words;
  
  determining a dot-product of the dimensional vector of the other text and the dimensional vector of the at least one text;
  
  comparing the dot-product to a threshold, wherein the threshold comprises an upper bound of a value in a range from zero to one that represents a cosine similarity between the other text and the at least one text;
  
  if the dot-product exceeds the threshold, determining the at least one text to be the good nearest neighbor of the other text;
  
  clustering the other text in a cluster; and
  
  clustering the at least one text determined to be the good nearest neighbor of the other text in the cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Tawde, Vivek B.
Primary Examiner(s)
Mariam; Daniel G

Application Number

US11/390,001
Publication Number

US 20070244874A1
Time in Patent Office

1,555 Days
Field of Search

382/209, 382/218, 382/224, 382/225, 382/305, 358/403, 707 1- 10
US Class Current

382/225
CPC Class Codes

G06F 16/355 Class or cluster creation o...

System and method for good nearest neighbor clustering of text

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

17 Citations

10 Claims

Specification

Use Cases

Quick Links

Others

System and method for good nearest neighbor clustering of text

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

10 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others