Method and system for model-based clustering and signal-bearing medium for storing program of same

US 6,397,166 B1
Filed: 11/06/1998
Issued: 05/28/2002
Est. Priority Date: 11/06/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method of grouping multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, comprising:

fitting the data points into a model relating the independent and dependent variables of the data points;

calculating similarity and distance between said data points and groups of said data points; and

based on calculated similarity and distance, determining whether to group the multiple data points.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for grouping multiple data points, each data point being a set (e.g., a vector, a tuple, etc.) including a measured dependent value and at least one related independent variable value, include fitting the data into a model relating the independent and dependent variables of the data, and calculating similarity and distance between the data points and groups of the data points, thereby to group the multiple data points.

208 Citations

31 Claims

1. A method of grouping multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, comprising:
- fitting the data points into a model relating the independent and dependent variables of the data points;
  
  calculating similarity and distance between said data points and groups of said data points; and
  
  based on calculated similarity and distance, determining whether to group the multiple data points.
- View Dependent Claims (2, 3, 4, 5, 6, 31)
- - 2. The method according to claim 1, wherein said calculating similarity between first and second data points includes:
3. The method according to claim 1, further comprising determining centers of groups of said data points.
4. The method according to claim 3, wherein said determining said centers of said groups includes:
- appending together the data sets corresponding to all the entities assigned to a group, and performing a least square regression;
  
  assuming m entities with n data elements each, using a model fitted in log space of lnY=lnA+γ
  
  M+β
  
  +lnε
  
  , where Y is a vector of size mn, A is a matrix of [a₁a₂. . . a_m], with a₁being a base sale for data set 1 and a_mbeing a base sale for data set m, M is a corresponding vector of markdowns, γ
  
  is a shared price sensitivity factor, and β
  
  is a vector of n shared seasonal indices;
  
  determining R²=1−
  
  (1−
  
  R²)*(n−
  
  1)/(n−
  
  1−
  
  c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average group score; and
  
  determining the similarity measure between the element and the center by one of using an array [β
  
  , γ
  
  ] to define the group center, and scoring each element of the group against the center, such that the element with the highest measure is designated as the center.
5. The method according to claim 3, wherein said calculating a similarity measure between first and second groups of data points, includes:
- scoring each element of each group against the center thereof by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
  
  calculating the similarity measure between the centers of two groups.
6. The method according to claim 1, wherein a distance between groups is determined by 100−
- the similarity measure.
31. The method according to claim 1, wherein said model is defined by

7. A system for grouping multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, comprising:
- means for fitting the data into a model relating the independent and dependent variables of the data;
  
  means for calculating similarity and distance between said data points and groups of said data points; and
  
  means for determining, based on calculated similarity and distance, whether to group the multiple data points.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system according to claim 7, wherein said means for calculating said similarity measure between first and second data points includes:
9. The system according to claim 7, further comprising means for determining centers of groups of said data points.
10. The system according to claim 9, wherein said means for calculating said centers of said groups includes:
- means for appending together the data sets corresponding to all the entities assigned to a cluster, and performing a least square regression;
  
  assuming m entities with n data elements each, means for using a model fitted in log space of in lnY=lnA+γ
  
  M+β
  
  +lnε
  
  , where Y is a vector of size mn, A is a matrix of [a₁a₂. . . a_m], with a₁being a base sale for data set 1 and a_mbeing a base sale for data set m, M is a corresponding vector of markdowns, γ
  
  is a shared price sensitivity factor, and β
  
  is a vector of n shared seasonal indices;
  
  means for determining R²=1−
  
  (1−
  
  R²)*(n−
  
  1)/(n−
  
  1−
  
  c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average group score; and
  
  means for determining the similarity measure between the element and the center by one of using an array [β
  
  , γ
  
  ] to define the group center, and scoring each element of the group against the center, such that the element with the highest measure is designated as the center.
11. The system according to claim 7, wherein said means for calculating a similarity measure between first and second groups of data points, includes:
- means for scoring each element of each group against the center thereof by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
  
  means for calculating the similarity measure between the centers of two groups.
12. The system according to claim 7, wherein a distance between groups is determined by 100−
- the similarity measure.

13. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented model-based grouping of multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, said method comprising:
- fitting the data into a model relating the independent and dependent variables of the data;
  
  calculating similarity and distance between said data points and groups of said data points; and
  
  based on calculated similarity and distance, determining whether to group the multiple data points.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The signal-bearing medium according to claim 13, wherein said determining said similarity measure between first and second data points includes:
15. The signal-bearing medium according to claim 13, further comprising determining centers of groups of said data points.
16. The signal-bearing medium according to claim 15, wherein said determining said centers of said groups includes:
- appending together the data sets corresponding to all entities assigned to a cluster, and performing a least square regression;
  
  assuming m entities with n data elements each, using a model fitted in log space of lnY=lnA+γ
  
  M+β
  
  +lnε
  
  , where Y is a vector of size mn, A is a matrix of [a₁a₂. . . a_m], with a₁being a base sale for data set 1 and a_mbeing a base sale for data set m, M is a corresponding vector of markdowns, γ
  
  is a shared price sensitivity factor, and β
  
  is a vector of n shared seasonal indices;
  
  determining R²=1−
  
  (1−
  
  R²)*(n−
  
  1)/(n−
  
  1−
  
  c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average group score; and
  
  determining the similarity measure between the element and the center by one of using an array [β
  
  , γ
  
  ] to define the group center, and scoring each element of the group against the center, such that the element with the highest measure is designated as the center.
17. The signal-bearing medium according to claim 15, wherein said calculating a similarity measure between first and second groups of data points, includes:
- scoring each element of each group against the center thereof by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
  
  calculating the similarity measure between the centers of two groups.
18. The signal-bearing medium according to claim 13, wherein a distance between groups is determined by 100−
- the similarity measure.

19. A method of model-based clustering, comprising:
- initializing clustering parameters for a plurality of items;
  
  providing a data set for clustering, and cluster center seeds, and calculating an target number of clusters;
  
  incrementing an iteration counter;
  
  scoring each item in the data set against all available cluster centers using a similarity measure process, wherein if a similarity measure value of the item being examined is greater than a minimum first parameter, no further search is performed for the item, and the item is assigned to a particular cluster, and when the similarity measure value is less than said minimum first parameter, the item is assigned to the cluster against which the item scores the highest;
  
  removing clusters having a predetermined low number of assigned items, said removed clusters including items which are unassigned;
  
  updating cluster centers for all remaining clusters;
  
  calculating an overall average cluster score as the average of all the average cluster scores to determine an overall distance, an overall distance being recorded for each iteration performed;
  
  determining whether an iteration is an odd-numbered iteration, wherein if it is determined that the iteration is an odd numbered iteration and that the remaining number of clusters is less than twice the target number calculated, then for each cluster checking a splitting criterion; and
  
  determining whether a cluster is a candidate for splitting based on whether 100−
  
  average cluster score is greater than the overall distance, and whether the cluster has more than twice the minimum number of items needed, wherein an item which scores the least by having a lowest similarity measure against the cluster center is used as a seed for a new cluster to be formed.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
- - 20. The method according to claim 19, further comprising:
21. The method according to claim 19, wherein said initializing comprises:
- providing user-input values for the parameters including at least one of the maximum number of clustering iterations to be performed, the minimum number of items needed to form a cluster, the minimum score needed to stop searching and assign an item to a cluster, and the minimum score needed to combine two clusters; and
  
  initializing dimensions of arrays used to read the data set.
22. The method according to claim 19, wherein said cluster center seeds are used as initial cluster centers.
23. The method according to claim 19, wherein said method is applied to retail transactions, andwherein said determining said similarity measure between first and second entities includes:
- appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of observed sales data and markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
  
  m+β
  
  +lnε
  
  , where Y is a vector of size 2n, A is a matrix of [a₁a₂], where a₁is a base sale for data set 1 and a₂is a base sale for data set 2, m is a corresponding vector of markdowns, γ
  
  is a shared price sensitivity factor, and β
  
  is a vector of n shared seasonal indices;
  
  using an adjusted regression coefficient R²obtained from the fit for the similarity measure, wherein R²is defined as 1-SSE/SS_yy, where SS_yy=Σ
  
  (y_i−
  
  {overscore (y)})², y being a mean of all observations of y, and SSE=Σ
  
  (y_i−
  
  ŷ
  
  )², where ŷ
  
  is a predicted value of y, based on the least square model fit, and adjusted R²=1−
  
  (1−
  
  R²)*(n−
  
  1)/(n−
  
  1−
  
  c), where n is a number of observations, and c is a number of coefficients estimated; and
  
  using the value 100* R²as the similarity measure between the two data sets.
24. The method according to claim 19, wherein said updating cluster centers for remaining clusters includes:
- appending together data sets corresponding to all the entities assigned to a cluster and performing a least square regression, wherein if there are m entities with n data elements each, then the model fitted in the log space is lnY=lnA+γ
  
  M+β
  
  +lnε
  
  , where Y is a vector of size mn, A is a matrix of [a₁a₂. . . a_m], a₁being a base sale for data set 1 and a_mbeing a base sale for data set m), M is a corresponding vector of markdowns, γ
  
  is a shared price sensitivity factor, and β
  
  is a vector of n shared seasonal indices;
  
  determining adjusted , R²=1−
  
  (1−
  
  R²)*(n−
  
  1)/(n−
  
  1−
  
  c), where n is a number of observations, and c is a number of coefficients estimated, R²being one of a measure of compactness of the cluster and an average cluster score; and
  
  one of using an array [β
  
  , γ
  
  ] to define the cluster center, and scoring each element of the cluster against the center by determining a similarity measure between the element and the center, and selecting the item with the highest measure as the center.
25. The method according to claim 24, wherein said process of finding a similarity measure between first and second clusters, includes:
- scoring each element of each cluster against the center by finding the similarity measure between the element and the center, and using the item with the highest measure as the center; and
  
  calculating the similarity measure between the centers of two clusters.
26. The method according to claim 19, wherein a distance between clusters is determined by 100−
- the similarity measure.

27. In a model-based clustering process for a plurality of data points, a method of determining a similarity measure between first and second data points of said plurality of data points includes:
- appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of the observed sales data and the markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
  
  m+β
  
  +lnε
  
  , where Y is a vector of size 2n, A is a matrix of [a₁a₂], where a₁is a base sale for data set 1 and a₂is a base sale for data set 2, m is a corresponding vector of markdowns, γ
  
  is a shared price sensitivity factor, and β
  
  is a vector of n shared seasonal indices;
  
  using an adjusted regression coefficient R²obtained from the fit for the similarity measure, wherein R²is defined as 1-SSE/SS_yy, where SS_yy=Σ
  
  (y_i−
  
  {overscore (y)})², y being a mean of all the observations of y, and SSE=Σ
  
  (y_i−
  
  ŷ
  
  )², where ŷ
  
  is a predicted value of y, based on the least square model fit, and adjusted R²=1−
  
  (1−
  
  R²)*(n−
  
  1)/(n−
  
  1−
  
  c), where n is a number of observations, and c is a number of coefficients estimated; and
  
  using the value 100* R²as the similarity measure between the two data sets.

28. In a model-based clustering process for a plurality of clusters of data, a method of calculating a similarity measure between first and second clusters of data, includes:
- determining centers of each of said first and second clusters;
  
  scoring each element of each cluster against a center by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
  
  calculating the similarity measure between the centers of two clusters.
- View Dependent Claims (29, 30)
- - 29. The method according to claim 28, wherein said determining said centers of said groups includes:
30. The method according to claim 28, wherein a distance between clusters is determined by 100−
- the similarity measure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Ramaswamy, Sanjay E., Leung, Ying Tat, Levanoni, Menachem
Primary Examiner(s)
ASSOUAD, PATRICK J

Application Number

US09/187,406
Time in Patent Office

1,299 Days
Field of Search

702/179,181 705/7,10,30 703/2 707/7,3,6
US Class Current

702/179
CPC Class Codes

G06F 17/18 for evaluating statistical ...

G06F 18/2321 using statistics or functio...

Method and system for model-based clustering and signal-bearing medium for storing program of same

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

208 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for model-based clustering and signal-bearing medium for storing program of same

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

208 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links