MERGING SEMANTICALLY SIMILAR CLUSTERS BASED ON CLUSTER LABELS

US 20150161248A1
Filed: 11/08/2013
Published: 06/11/2015
Est. Priority Date: 09/30/2011
Status: Active Grant

First Claim

Patent Images

1-24. -24. (canceled)

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A server device may receive first label information regarding a first cluster that includes information identifying a first set of documents, where the first label information regarding the first cluster includes a first set of labels that are associated with the first cluster, and second label information regarding a second cluster that includes information identifying a second set of documents, where the second label information regarding the second cluster includes a second set of labels that are associated with the second cluster, where the second set of documents is different from the first set of documents. The server device may determine that the first and second clusters are semantically similar, which may include determining whether a similarity of the first and second clusters is above a similarity threshold. The server device may also form a merged cluster by merging the first and second clusters. The server device may further determine one or more labels for the merged cluster. Furthermore, the server device may assign the one or more labels to the merged cluster.

Citations

44 Claims

1-24. -24. (canceled)

25. A method comprising:
- forming, by one or more devices, a merged cluster of documents based on a first cluster of documents and a second cluster of documents;
  
  identifying, by the one or more devices, one or more labels that are associated with at least one of the first cluster of documents or the second cluster of documents;
  
  identifying, by the one or more devices, a first confidence score, associated with a particular label of the one or more labels, with respect to the first cluster of documents;
  
  identifying, by the one or more devices, a second confidence score, associated with the particular label, with respect to the second cluster of documents;
  
  determining, by the one or more devices, an overall confidence score based on the first confidence score and the second confidence score; and
  
  selectively assigning, by the one or more devices, the particular label to the merged cluster based on the overall confidence score.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33)
- - 26. The method of claim 25, further comprising:
    - receiving first information regarding the first cluster of documents; and
      
      receiving second information regarding the second cluster of documents,identifying the one or more labels including;
      
      identifying the one or more labels based on the first information and the second information.
  - 27. The method of claim 25, where the merged cluster of documents includes all documents in the first cluster of documents and all documents in the second cluster of documents.
  - 28. The method of claim 25, further comprising:
    - identifying a set of labels that is associated with the first cluster of documents; and
      
      discarding one or more particular labels from the set of labels,identifying the one or more labels including;
      
      identifying the one or more labels based on the set of labels after discarding the one or more particular labels from the set of labels.
  - 29. The method of claim 28, where discarding the one or more particular labels from the set of labels includes:
    - determining that the one or more particular labels are misspelled or that the one or more particular labels are included in a particular list of words, anddiscarding the one or more particular labels based on determining that the one or more particular labels are misspelled or that the one or more particular labels are included in the particular list of words.
  - 30. The method of claim 25, where forming the merged cluster comprises:
    - determining that the first cluster of documents and the second cluster of documents are semantically similar; and
      
      forming the merged cluster based on determining that the first cluster of documents and the second cluster of documents are semantically similar.
  - 31. The method of claim 30,where determining that the first cluster of documents and the second cluster of documents are semantically similar comprises:
    - determining a first vector for first labels that are associated with the first cluster of documents,determining a second vector for second labels that are associated with the second cluster of documents,determining a measure of similarity of the first vector and the second vector,determining that the measure of similarity satisfies a particular threshold, anddetermining that the first cluster of documents and the second cluster of documents are semantically similar based on determining that the measure of similarity satisfies the particular threshold, andwhere the one or more labels include at least one of a group of the first labels or a group of the second labels.
  - 32. The method of claim 30,where the one or more labels include at least one of:
    - a group of first labels that are associated with the first cluster of documents, ora group of second labels that are associated with the second cluster of documents, andwhere determining that the first cluster of documents and the second cluster of documents are semantically similar comprises;
      
      determining that the first labels are a subset of the second labels, anddetermining that the first cluster of documents and the second cluster of documents are semantically similar based on determining that the first labels are the subset of the second labels.
  - 33. The method of claim 25, further comprising:
    - providing the merged cluster to a cluster repository server after selectively assigning the particular label to the merged cluster,the cluster repository server storing information regarding the merged cluster, andthe cluster repository server deleting information regarding the first cluster of documents and the second cluster of documents based on storing the information regarding the merger cluster.

34. A system comprising:
- one or more processors to;
  
  identify a label associated with at least one of a first cluster or a second cluster;
  
  identify a first confidence score, associated with the label, with respect to the first cluster;
  
  identify a second confidence score, associated with the label, with respect to the second cluster;
  
  determine an overall confidence score based on the first confidence score and the second confidence score; and
  
  selectively assign the label to a merged cluster based on the overall confidence score,the merged cluster including at least a group of the first cluster or the second cluster.
- View Dependent Claims (35, 36, 37, 38, 39, 40, 41)
- - 35. The system of claim 34, where the one or more processors are further to:
    - form the merged cluster based on the first cluster and the second cluster.
  - 36. The system of claim 34, where, when identifying the first confidence score, the one or more processors are to:
    - determine that the label is not associated with the first cluster, andassign the first confidence score to the label, with respect to the first cluster, based on determining that the label is not associated with the first cluster.
  - 37. The system of claim 34, where, when determining the overall confidence score, the one or more processors are to:
    - identify a first weight associated with the first cluster,identify a second weight associated with the second cluster, anddetermine an overall confidence score based on the first confidence score, the first weight, the second confidence score, and the second weight.
  - 38. The system of claim 37, where, when identifying the first weight, the one or more processors are to:
    - determine a quantity of documents in the first cluster, andidentify the first weight based on the quantity of documents.
  - 39. The system of claim 37, where, when identifying the first weight, the one or more processors are to:
    - determine a particular score based on one of a quantity of links to or from documents in the first cluster, an age of the documents, or traffic to or from the documents, andidentify the first weight based on the particular score.
  - 40. The system of claim 34, where, when determining the overall confidence score, the one or more processors are to:
    - determine an average of the first confidence score and the second confidence score, anddetermine the overall confidence score based on the average.
  - 41. The system of claim 34, where, when selectively assigning the label to the merged cluster, the one or more processors are to:
    - determine that the overall confidence score satisfies a threshold, andassign the label to the merged cluster based on determining that the overall confidence score satisfies the threshold.

42. A non-transitory computer-readable medium storing instructions, the instructions comprising:
- one or more instructions that, when executed by at least one processor, cause the at least one processor to;
  
  determine a first vector for a first set of labels that are associated with a first cluster of documents;
  
  determine a second vector for a second set of labels that are associated with a second cluster of documents;
  
  determine a measure of similarity of the first vector and the second vector;
  
  determine that the measure of similarity satisfies a particular threshold;
  
  determine that the first cluster of documents and the second cluster of documents are semantically similar based on determining that the measure of similarity satisfies the particular threshold; and
  
  form a merged cluster based on determining that the first cluster of documents and the second cluster of documents are semantically similar.
- View Dependent Claims (43, 44)
- - 43. The non-transitory computer-readable medium of claim 42, where the instructions further comprise:
    - one or more instructions that, when executed by the at least one processor, cause the at least one processor to;
      
      identify one or more particular labels, in the first set of labels, that are misspelled or correspond to words in a particular list of words; and
      
      discard the one or more particular labels from the first set of labels.
  - 44. The non-transitory computer-readable medium of claim 42, where the instructions further comprise:
    - one or more instructions that, when executed by the at least one processor, cause the at least one processor to;
      
      identify a particular label of the first set of labels and the second set of labels;
      
      determine a confidence score for the particular label; and
      
      selectively assign the particular label to the merged cluster based on the confidence score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
MAJKOWSKA, Anna Dagna

Granted Patent

US 9,336,301 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

MERGING SEMANTICALLY SIMILAR CLUSTERS BASED ON CLUSTER LABELS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

MERGING SEMANTICALLY SIMILAR CLUSTERS BASED ON CLUSTER LABELS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links