System and method for adaptive categorization for use with dynamic taxonomies
First Claim
Patent Images
1. A computer-implemented method for categorizing data points belonging to a data set, said method comprising:
- matching a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories;
for said data point having said textual description, generating, using a processor device, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; and
assigning each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated one or more preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering including;
assigning an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points;
assigning each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label;
updating a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, andrepeating said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method and computer program product provides a solution to a class of categorization problems using a semi-supervised clustering approach, the method employing performing a Soft Seeded k-means algorithm, which makes effective use of the side information provided by seeds with a wide range of confidence levels, even when they do not provide complete coverage of the pre-defined categories. The semi-supervised clustering is achieved through the introductions of a seed re-assignment penalty measure and model selection measure.
36 Citations
16 Claims
-
1. A computer-implemented method for categorizing data points belonging to a data set, said method comprising:
-
matching a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories; for said data point having said textual description, generating, using a processor device, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; and assigning each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated one or more preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering including; assigning an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points; assigning each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label; updating a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, and repeating said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product for categorizing data points belonging to a data set, said computer program product comprising:
a non-transitory computer usable medium having computer usable program code embodied therewith, the computer usable program code comprising; computer usable program code configured to match a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories; computer usable program code configured to generate, for said data point having said textual description, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; computer usable program code configured to assign each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated one or more preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering including; assigning an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points; assigning each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label; updating a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, and repeating said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
15. A system for categorizing data points belonging to a data set comprising:
-
at least one processor; and at least one memory device connected to the at least one processor, wherein the at least one processor is programmed to; match a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories; generate, for said data point having said textual description, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; and assign each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering that; assigns an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points; assigns each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label; updates a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, and repeats said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs. - View Dependent Claims (16)
-
Specification