System and method for adaptive categorization for use with dynamic taxonomies

US 8,161,028 B2
Filed: 12/05/2008
Issued: 04/17/2012
Est. Priority Date: 12/05/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for categorizing data points belonging to a data set, said method comprising:

matching a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories;

for said data point having said textual description, generating, using a processor device, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; and

assigning each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated one or more preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering including;

assigning an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points;

assigning each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label;

updating a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, andrepeating said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method and computer program product provides a solution to a class of categorization problems using a semi-supervised clustering approach, the method employing performing a Soft Seeded k-means algorithm, which makes effective use of the side information provided by seeds with a wide range of confidence levels, even when they do not provide complete coverage of the pre-defined categories. The semi-supervised clustering is achieved through the introductions of a seed re-assignment penalty measure and model selection measure.

36 Citations

View as Search Results

16 Claims

1. A computer-implemented method for categorizing data points belonging to a data set, said method comprising:
- matching a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories;
  
  for said data point having said textual description, generating, using a processor device, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; and
  
  assigning each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated one or more preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering including;
  
  assigning an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points;
  
  assigning each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label;
  
  updating a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, andrepeating said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, wherein said processor device further performs:
    - determining a resulting cluster model for said assigned labeled and un-labeled data points, and generating a fitness score based on said resulting cluster model;
      
      repeating by said processor device, said cluster assigning of labeled and un-labeled data points and said centroid updating of said semi-supervised soft-seeded k-means clustering using different re-initialized centroid assignments at each of a plurality of iterations, and determining a resulting cluster model and corresponding fitness score for said assigned labeled and un-labeled data points at each iteration; and
      
      selecting the resulting cluster model corresponding to a highest fitness score as a final assignment of all data points to the predefined set of categories.
  - 3. The computer-implemented method of claim 2, wherein said distortion measure comprises a pairwise distance measure D(x_i, ξ
    - _j) between a data point x_iof the data set and a calculated centroid ξ
      
      _jof a particular cluster “
      
      j”
      
      .
  - 4. The computer-implemented method of claim 2, further including:
    - computing each initial centroid covered by seeds using the labeled data points having associated seeds; and
      
      computing each initial centroids of remaining clusters through random sampling of un-labeled data points having no associated seeds.
  - 5. The computer-implemented method of claim 2, wherein at each of said plurality of iterations, computing each initial centroid of remaining clusters according to a different random sampling of un-labeled data points.
  - 6. The computer-implemented method of claim 2, wherein said penalty value comprises a value determined in accordance with a sigmoid function P( ) of said seed score, said sigmoid function P( ) computed, at said computing device, according to:
    - P(j, li, si)=0 if j=0 or j=l_i;
      
      otherwise
  - 7. The computer-implemented method of claim 2, wherein said fitness score is based on said cluster model according to:
    - (D_l*|D_l−
      
      D_ul|)^−
      
      1where D_lrepresents a distortion measure of the labeled data points, and D_urepresents a distortion measure of the un-labeled data points.

8. A computer program product for categorizing data points belonging to a data set, said computer program product comprising:
- a non-transitory computer usable medium having computer usable program code embodied therewith, the computer usable program code comprising;
  
  computer usable program code configured to match a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories;
  
  computer usable program code configured to generate, for said data point having said textual description, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching;
  
  computer usable program code configured to assign each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated one or more preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering including;
  
  assigning an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points;
  
  assigning each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label;
  
  updating a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, andrepeating said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, further comprising:
    - computer usable program code configured to determine a resulting cluster model for said assigned labeled and un-labeled data points, and generating a fitness score based on said resulting cluster model;
      
      computer usable program code configured for repeating said cluster assigning of labeled and un-labeled data points and said centroid updating of said semi-supervised soft-seeded k-means clustering using different re-initialized centroid assignments at each of a plurality of iterations, and determining a resulting cluster model and corresponding fitness score for said assigned labeled and un-labeled data points at each iteration; and
      
      computer usable program code configured for selecting the resulting cluster model corresponding to a highest fitness score as a final assignment of all data points to the predefined set of categories.
  - 10. The computer program product of claim 9, wherein said distortion measure comprises a pairwise distance measure D(x_i, ξ
    - _j) between a data point x_iof the data set and a calculated centroid ξ
      
      _jof a particular cluster “
      
      j”
      
      .
  - 11. The computer program product of claim 9, further including:
    - computer usable program code configured to compute each initial centroid covered by seeds using the labeled data points having associated seeds; and
      
      computer usable program code configured to compute each initial centroids of remaining clusters through random sampling of un-labeled data points having no associated seeds.
  - 12. The computer program product of claim 9, wherein at each of said plurality of iterations, computer usable program code configured to compute the initial centroid further computes each initial centroid of remaining clusters according to a different random sampling of un-labeled data points.
  - 13. The computer program product of claim 9, wherein said penalty value comprises a value determined in accordance with a sigmoid function P( ) of said seed score, said sigmoid function P( ) computed, at said computing device, according to:
    - P(j, li, si)=0 if j=0 or j=l_i;
      
      otherwise
  - 14. The computer program product of claim 9, wherein said fitness score is based on said cluster model according to:
    - (D_l*|D_l−
      
      D_ul|)^−
      
      1where D_lrepresents a distortion measure of the labeled data points, and D_urepresents a distortion measure of the un-labeled data points.

15. A system for categorizing data points belonging to a data set comprising:
- at least one processor; and
  
  at least one memory device connected to the at least one processor, wherein the at least one processor is programmed to;
  
  match a textual description of a data point of said data set to category descriptions relating to one or more pre-defined set of categories;
  
  generate, for said data point having said textual description, one or more preliminary soft seed labels corresponding to one of the one or more pre-defined set of categories and corresponding seed score based on a result of said matching; and
  
  assign each of said data points into a predefined number of clusters corresponding to the one or more predefined set of categories using the generated preliminary soft seed labels, said cluster assigning using semi-supervised soft-seeded k-means clustering that;
  
  assigns an initial centroid to each predefined cluster, wherein, an initial centroid of a particular cluster is computed based on said one or more preliminary soft seed labels, and for a pre-defined cluster not covered by said soft seed labels, computing an initial centroid based on random sampling from all un-labeled data points;
  
  assigns each of labeled and said un-labeled data points to a cluster in a manner to minimize a distortion measure, said distortion measure including a seed re-assignment penalty value component as a function of said corresponding seed score, said seed re-assignment penalty value assessed upon determining a labeled or un-labeled data point assignment to a category different from the generated preliminary soft seed label;
  
  updates a centroid value for each said cluster to which a labeled or un-labeled data point has been assigned, andrepeats said labeled or un-labeled data point assigning and centroid value updating until no re-assignment of soft seed labels to clusters of different categories occurs.
- View Dependent Claims (16)
- - 16. The system of claim 15, wherein the at least one processor is further programmed to:
    - determine a resulting cluster model for said assigned labeled and un-labeled data points, and generating a fitness score based on said resulting cluster model;
      
      repeat said cluster assigning of labeled and un-labeled data points and said centroid updating of said semi-supervised soft-seeded k-means clustering using different re-initialized centroid assignments at each of a plurality of iterations, and determining a resulting cluster model and corresponding fitness score for said assigned labeled and un-labeled data points at each iteration; and
      
      select the resulting cluster model corresponding to a highest fitness score as a final assignment of all data points to the predefined set of categories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Hu, Jianying, Mojsilovic, Aleksandra, Singh, Moninder
Primary Examiner(s)
DANG, THANH HA T

Application Number

US12/315,724
Publication Number

US 20100145961A1
Time in Patent Office

1,229 Days
Field of Search

707/706, 707/737, 707/738, 707/740, 707/749, 707/750, 707/751, 707/752, 707/754, 707/763, 707/771
US Class Current

707/706
CPC Class Codes

G06F 16/355 Class or cluster creation o...

System and method for adaptive categorization for use with dynamic taxonomies

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

36 Citations

16 Claims

Specification

Use Cases

Quick Links

Others

System and method for adaptive categorization for use with dynamic taxonomies

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

16 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others