Automatic incremental labeling of document clusters
First Claim
1. A computer implemented method comprising:
- assembling a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates;
partitioning, by a non-transitory computing device, documents from the set of documents into multiple clusters;
determining, by the non-transitory computing device, that a dominant topic exists within a first cluster of said multiple clusters;
determining, by the computing device, (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and
labeling, by the computing device, at least documents from the second plurality of documents within said one of the multiple clusters with the label identifying the dominant topic when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for use in labeling documents within a cluster are provided. One example method includes assembling a set of documents including a first plurality of previously clustered documents and a second plurality of documents. Each of the first plurality of previously clustered documents has at least one label identifying a topic to which content of the document relates. The method includes partitioning documents from the set of documents into multiple clusters, determining if a dominant topic exists within one of the multiple clusters, determining a metric value for one of the multiple clusters based on the number of documents within the one of the multiple clusters having a label identifying the determined dominant topic, and labeling at least documents from the second plurality of documents within the one of the multiple clusters with the label identifying the dominant topic when the metric value exceeds a predetermined threshold.
-
Citations
20 Claims
-
1. A computer implemented method comprising:
-
assembling a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates; partitioning, by a non-transitory computing device, documents from the set of documents into multiple clusters; determining, by the non-transitory computing device, that a dominant topic exists within a first cluster of said multiple clusters; determining, by the computing device, (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and labeling, by the computing device, at least documents from the second plurality of documents within said one of the multiple clusters with the label identifying the dominant topic when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold. - View Dependent Claims (2, 3, 4, 6, 7, 8, 9)
-
-
5. The method of 4, wherein designating the first cluster for further processing includes designating the first cluster for manual processing.
-
10. A system comprising:
-
a non-transitory storage device configured to store a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates; a clustering engine configured to; partition documents from the set of documents into multiple clusters; and determine that a dominant topic exists within a first cluster of said multiple clusters; determine (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and a labeling engine configured to assign a label identifying the dominant topic to at least documents from the second plurality of documents within said one of the multiple clusters when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory computer-readable storage device having encoded thereon computer readable instructions, which when executed by a processor, cause the processor to:
-
provide a set of documents, the set of documents including a first plurality of previously clustered documents and a second plurality of documents, each of the first plurality of previously clustered documents having at least one label identifying a topic to which content of the document relates; partition documents from the set of documents into multiple clusters; determine that a dominant topic exists within a first cluster of said multiple clusters; determine (i) a purity score representing a first ratio of a number of documents having a label identifying the dominant topic in the first cluster to a total number of previously clustered documents within the first cluster and (ii) a confidence measure representing a second ratio of the total number of previously clustered documents in the first cluster to a size of the first cluster, wherein the size of the first cluster equals a total number od documents included within the first cluster; and assign the label identifying the dominant to at least documents from the second plurality of documents within said one of the multiple clusters when the purity score exceeds a first predetermined threshold and the confidence score exceeds a second predetermined threshold. - View Dependent Claims (20)
-
Specification