CLASSIFYING DOCUMENTS BY CLUSTER

US 20160314184A1
Filed: 04/27/2015
Published: 10/27/2016
Est. Priority Date: 04/27/2015
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method, comprising:

grouping, by a computing system, a corpus of documents into a plurality of disjoint clusters of documents based on one or more shared content attributes;

determining, by the computing system, a classification distribution associated with a first cluster of the plurality of clusters, the classification distribution associated with the first cluster being based on classifications assigned to individual documents of the first cluster; and

calculating, by the computing system, a classification distribution associated with a second cluster of the plurality of clusters based at least in part on the classification distribution associated with the first cluster and a relationship between the first and second clusters.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, apparatus, systems, and computer-readable media are provided for classifying, or “labeling,” documents such as emails en masse based on association with a cluster/template. In various implementations, a corpus of documents may be grouped into a plurality of disjoint clusters of documents based on one or more shared content attributes. A classification distribution associated with a first cluster of the plurality of clusters may be determined based on classifications assigned to individual documents of the first cluster. A classification distribution associated with a second cluster of the plurality of clusters may then be determined based at least in part on the classification distribution associated with the first cluster and a relationship between the first and second clusters.

55 Citations

View as Search Results

20 Claims

1. A computer-implemented method, comprising:
- grouping, by a computing system, a corpus of documents into a plurality of disjoint clusters of documents based on one or more shared content attributes;
  
  determining, by the computing system, a classification distribution associated with a first cluster of the plurality of clusters, the classification distribution associated with the first cluster being based on classifications assigned to individual documents of the first cluster; and
  
  calculating, by the computing system, a classification distribution associated with a second cluster of the plurality of clusters based at least in part on the classification distribution associated with the first cluster and a relationship between the first and second clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The computer-implemented method of claim 1, further comprising classifying, by the computing system, documents of the second cluster based on the classification distribution associated with the second cluster.
  - 3. The computer-implemented method of claim 1, further comprising generating, by the computing system, a graph of nodes, each node connected to one or more other nodes via one or more respective edges, each node representing a cluster and including some indication of one or more content attributes shared by documents of the cluster.
  - 4. The computer-implemented method of claim 3, wherein each edge connecting two nodes is weighted based on a relationship between clusters represented by the two nodes.
  - 5. The computer-implemented method of claim 4, further comprising determining the relationship between clusters represented by the two nodes using cosine similarity or Kullback-Leibler divergence.
  - 6. The computer-implemented method of claim 4, further comprising connecting each node to k nearest neighbor nodes using k edges, wherein the k nearest neighbor nodes have the k strongest relationships with the node, and k is a positive integer.
  - 7. The computer-implemented method of claim 6, wherein each node includes an indication of a classification distribution associated with a cluster represented by that node.
  - 8. The computer-implemented method of claim 7, further comprising altering a classification distribution associated with a particular cluster based on m classification distributions associated with m nodes connected to a particular node representing the particular cluster, wherein m is a positive integer less than or equal to k.
  - 9. The computer-implemented method of claim 8, wherein the altering is further based on m weights assigned to m edges connecting the m nodes to the particular node.
  - 10. The computer-implemented method of claim 1, further comprising calculating centroid vectors for available classifications of at least the classification distribution associated with the first cluster.
  - 11. The computer-implemented method of claim 10, further comprising calculating the classification distribution associated with the second cluster based on a relationship between the second cluster and at least one centroid vector.
  - 12. The computer-implemented method of claim 1, further comprising:
    - generating a first template associated with the first cluster based on one or more content attributes shared among documents of the first cluster; and
      
      generating a second template associated with the second cluster based on one or more content attributes shared among documents of the second cluster.
  - 13. The computer-implemented method of claim 12, wherein the classification distribution associated with the second cluster is further calculated based at least in part on a similarity between the first and second templates.
  - 14. The computer-implemented method of claim 13, further comprising determining the similarity between the first and second templates using cosine similarity or Kullback-Leibler divergence.
  - 15. The computer-implemented method of claim 12, wherein:
    - generating the first template comprises generating a first set of fixed text portions found in at least a threshold fraction of documents of the first cluster; and
      
      generating the second template comprises generating second set of fixed text portions found in at least a threshold fraction of documents of the second cluster.
  - 16. The computer-implemented method of claim 12, whereingenerating the first template comprises calculating a first set of topics based on content of documents of the first cluster;
    - andgenerating the second template comprises calculating a second set of topics based on content of documents of the second cluster;
      
      wherein the first and second sets of topics are calculated using latent Dirichlet allocation.

17. A system including memory and one or more processors operable to execute instructions stored in the memory, comprising instructions to:
- group a corpus of documents into a plurality of disjoint clusters of documents based on one or more shared content attributes;
  
  determine a classification distribution associated with a first cluster of the plurality of disjoint clusters, the classification distribution associated with the first cluster being based on classifications assigned to individual documents of the first cluster;
  
  calculate a classification distribution associated with a second cluster of the plurality of disjoint clusters based at least in part on the classification distribution associated with the first cluster and a relationship between the first and second clusters; and
  
  classify documents of the second cluster based on the classification distribution associated with the second cluster.
- View Dependent Claims (18, 19)
- - 18. The system of claim 17, further comprising instructions to:
    - generate a graph of nodes, each node connected to one or more other nodes via one or more respective edges, wherein each node represents a cluster and each edge connecting two nodes is weighted based on a relationship between clusters represented by the two nodes; and
      
      alter a classification distribution associated with a particular cluster based on;
      
      one or more classification distributions associated with one or more nodes connected to a particular node representing the particular cluster; and
      
      one or more weights assigned to one or more edges connecting the one or more nodes to the particular node.
  - 19. The system of claim 17, further comprising instructions to:
    - calculate one or more centroid vectors for one or more available classifications of at least the classification distribution associated with the first cluster; and
      
      calculate the classification distribution associated with the second cluster based on a relationship between the second cluster and at least one of the one or more centroid vectors.

20. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by a computing system, cause the computing system to perform the operations of:
- grouping a corpus of documents into a plurality of disjoint clusters of documents based on one or more shared content attributes;
  
  determining a classification distribution associated with a first cluster of the plurality of disjoint clusters, the classification distribution associated with the first cluster being based on classifications assigned to individual documents of the first cluster; and
  
  calculating a classification distribution associated with a second cluster of the plurality of disjoint clusters based at least in part on the classification distribution associated with the first cluster and a relationship between the first and second clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Krka, Ivo, Wendt, James, Pueyo, Luis Garcia, Bendersky, Mike, Yang, Jie, Saikia, Amitabh, Cartright, Marc-Allen, Ravi, Sujith, Miklos, Balint, Josifovski, Vanja

Application Number

US14/697,342
Publication Number

US 20160314184A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/35 Clustering; Classification

G06Q 10/107 Computer-aided management o...

CLASSIFYING DOCUMENTS BY CLUSTER

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

55 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

CLASSIFYING DOCUMENTS BY CLUSTER

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links