Automatic incremental labeling of document clusters

US 9,002,848 B1
Filed: 06/22/2012
Issued: 04/07/2015
Est. Priority Date: 12/27/2011
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method comprising:

assembling a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates;

partitioning, by a non-transitory computing device, documents from the set of documents into multiple clusters;

determining, by the non-transitory computing device, that a dominant topic exists within a first cluster of said multiple clusters;

determining, by the computing device, (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and

labeling, by the computing device, at least documents from the second plurality of documents within said one of the multiple clusters with the label identifying the dominant topic when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for use in labeling documents within a cluster are provided. One example method includes assembling a set of documents including a first plurality of previously clustered documents and a second plurality of documents. Each of the first plurality of previously clustered documents has at least one label identifying a topic to which content of the document relates. The method includes partitioning documents from the set of documents into multiple clusters, determining if a dominant topic exists within one of the multiple clusters, determining a metric value for one of the multiple clusters based on the number of documents within the one of the multiple clusters having a label identifying the determined dominant topic, and labeling at least documents from the second plurality of documents within the one of the multiple clusters with the label identifying the dominant topic when the metric value exceeds a predetermined threshold.

Citations

20 Claims

1. A computer implemented method comprising:
- assembling a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates;
  
  partitioning, by a non-transitory computing device, documents from the set of documents into multiple clusters;
  
  determining, by the non-transitory computing device, that a dominant topic exists within a first cluster of said multiple clusters;
  
  determining, by the computing device, (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and
  
  labeling, by the computing device, at least documents from the second plurality of documents within said one of the multiple clusters with the label identifying the dominant topic when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold.
- View Dependent Claims (2, 3, 4, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising determining that the purity score exceeds the first predetermined threshold upon determining that the confidence measure exceeds the second predetermined threshold.
  - 3. The method of claim 1, wherein the second predetermined threshold is at least 80%.
  - 4. The method of claim 1, further comprising designating the first cluster for further processing when the purity score fails to exceed the first predetermined threshold or when the confidence measure fails to exceed the second predetermined threshold.
  - 6. The method of claim 1, wherein the set of documents includes at least one of email documents, social media documents, and customer input documents.
  - 7. The method of claim 1, wherein the at least one label is a plurality of labels, said method further comprising:
    - assigning a respective weight to each label in the plurality of labels; and
      
      determining the dominant topic based at least in part on the respective weights.
  - 8. The method of claim 1, wherein partitioning documents from the set of documents into multiple clusters comprises parsing documents from the set of documents to extract occurrences of a plurality of textual elements, identifying at least one of the plurality of textual elements that are common to two or more documents, and grouping the two or more documents based on the identified at least one textual elements.
  - 9. The method of claim 1, further comprising receiving the second plurality of documents, after assigning the at least one label to each of the first plurality of previously clustered documents and prior to assembling the set of documents.

5. The method of 4, wherein designating the first cluster for further processing includes designating the first cluster for manual processing.

10. A system comprising:
- a non-transitory storage device configured to store a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates;
  
  a clustering engine configured to;
  
  partition documents from the set of documents into multiple clusters; and
  
  determine that a dominant topic exists within a first cluster of said multiple clusters;
  
  determine (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and
  
  a labeling engine configured to assign a label identifying the dominant topic to at least documents from the second plurality of documents within said one of the multiple clusters when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein said clustering engine is further configured to determine that the purity score exceeds the first predetermined threshold upon determining that the confidence measure exceeds the second predetermined threshold.
  - 12. The system of claim 10, wherein the second predetermined threshold is at least 80%.
  - 13. The system of claim 10, wherein the set of documents includes at least one of email documents, social media documents, and customer input documents.
  - 14. The system of claim 10, wherein the labeling engine is further configured to assign a label identifying the dominant topic to each document in the first cluster.
  - 15. The system of claim 10, wherein the labeling engine is configured to designate the first cluster for further processing when the purity score fails to exceed the first predetermined threshold.
  - 16. The system of claim 10, wherein the clustering engine is configured to parse documents from the set of documents to extract occurrences of a plurality of textual elements, to identify at least one of the plurality of textual elements that are common to two or more documents, and to group the two or more documents based on the identified at least one textual elements.
  - 17. The system of claim 10, wherein the at least one label is a plurality of labels, said system is further configured to:
    - assign a respective weight to each label in the plurality of labels; and
      
      determine the dominant topic based at least in part on the respective weights.
  - 18. The system of claim 10, further configured to receive the second plurality of documents, after the at least one label are assigned to each of the first plurality of previously clustered documents.

19. A non-transitory computer-readable storage device having encoded thereon computer readable instructions, which when executed by a processor, cause the processor to:
- provide a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates;
  
  partition documents from the set of documents into multiple clusters;
  
  determine that a dominant topic exists within a first cluster of said multiple clusters;
  
  determine (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and
  
  assign the label identifying the dominant to at least documents from the second plurality of documents within said one of the multiple clusters when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold.
- View Dependent Claims (20)
- - 20. The computer-readable storage device in accordance with claim 19, wherein said instructions additionally cause the processor to determine that the purity score exceeds the first predetermined threshold upon determining that the confidence measure exceeds the second predetermined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Peng, Jun, Ben-Artzi, Aner, Buryak, Kirill, Lewis, Glenn M.
Primary Examiner(s)
DWIVEDI, MAHESH H

Application Number

US13/530,764
Time in Patent Office

1,019 Days
Field of Search

707/737, 707/738
US Class Current

707/737
CPC Class Codes

G06F 16/35 Clustering; Classification

G06F 16/355 Class or cluster creation o...

Automatic incremental labeling of document clusters

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic incremental labeling of document clusters

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links