Thematic clustering
First Claim
Patent Images
1. A system, comprising:
- a processor configured to;
cluster a data set into one or more initial clusters using a first term space comprising a plurality of keywords;
determine an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space;
reduce the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword term that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold;
recluster at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering;
assign at least one singleton to a baby cluster to form one or more renovated clusters;
determine a renovated theme for at least one of the renovated clusters; and
provide as output one or more of the renovated clusters and their respective themes; and
a memory coupled to the processor and configured to provide the processor with instructions.
4 Assignments
0 Petitions
Accused Products
Abstract
A data set is clustered into one or more initial clusters using a first term space. Initial themes for the initial clusters are determined. The first term space is reduced to create a reduced term space. At least a portion of the data set is reclustered into one or more baby clusters using the reduced term space. One or more singletons are reassigned to form one or more renovated clusters from the baby clusters. A renovated theme is determined for at least some of the renovated clusters. One or more of the renovated clusters and their respective themes are provided as output.
143 Citations
20 Claims
-
1. A system, comprising:
-
a processor configured to; cluster a data set into one or more initial clusters using a first term space comprising a plurality of keywords; determine an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space; reduce the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword term that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold; recluster at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering; assign at least one singleton to a baby cluster to form one or more renovated clusters; determine a renovated theme for at least one of the renovated clusters; and provide as output one or more of the renovated clusters and their respective themes; and a memory coupled to the processor and configured to provide the processor with instructions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 13)
-
-
11. A method, comprising:
-
clustering a data set into one or more initial clusters using a first term space comprising a plurality of keywords; determining an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space; reducing the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold; reclustering at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering; assigning at least one singleton to a baby cluster to form one or more renovated clusters; determining a renovated theme for at least one of the renovated clusters; and providing as output one or more of the renovated clusters with their respective themes. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
-
12. A computer program product embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
-
clustering a data set into one or more initial clusters using a first term space comprising a plurality of keywords; determining an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space; reducing the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword term that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold; reclustering at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering; assigning at least one singleton to a baby cluster to form one or more renovated clusters; determining a renovated theme for at least one of the renovated clusters; and providing as output one or more of the renovated clusters and their respective themes.
-
Specification