Method and system for model-based clustering and signal-bearing medium for storing program of same
First Claim
Patent Images
1. A method of grouping multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, comprising:
- fitting the data points into a model relating the independent and dependent variables of the data points;
calculating similarity and distance between said data points and groups of said data points; and
based on calculated similarity and distance, determining whether to group the multiple data points.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for grouping multiple data points, each data point being a set (e.g., a vector, a tuple, etc.) including a measured dependent value and at least one related independent variable value, include fitting the data into a model relating the independent and dependent variables of the data, and calculating similarity and distance between the data points and groups of the data points, thereby to group the multiple data points.
208 Citations
31 Claims
-
1. A method of grouping multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, comprising:
-
fitting the data points into a model relating the independent and dependent variables of the data points;
calculating similarity and distance between said data points and groups of said data points; and
based on calculated similarity and distance, determining whether to group the multiple data points. - View Dependent Claims (2, 3, 4, 5, 6, 31)
appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of observed sales data and markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
m+β
+lnε
, where Y is a vector of size 2n, A is a matrix of [a1 a2], where a1 is a base sale for data set 1 and a2 is a base sale for data set 2, m is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
using an adjusted regression coefficient R2 obtained from the fit for the similarity measure, wherein R2 is defined as 1-SSE/SSyy, where SSyy=Σ
(yi−
{overscore (y)})2, y being a mean of all observations of y, and SSE=Σ
(yi−
ŷ
)2, where ŷ
is a predicted value of y, based on the least square model fit, and adjusted R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated; and
using a value 100* R2 as the similarity measure between the two data sets.
-
-
3. The method according to claim 1, further comprising determining centers of groups of said data points.
-
4. The method according to claim 3, wherein said determining said centers of said groups includes:
-
appending together the data sets corresponding to all the entities assigned to a group, and performing a least square regression;
assuming m entities with n data elements each, using a model fitted in log space of lnY=lnA+γ
M+β
+lnε
, where Y is a vector of size mn, A is a matrix of [a1 a2 . . . am], with a1 being a base sale for data set 1 and am being a base sale for data set m, M is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
determining R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average group score; and
determining the similarity measure between the element and the center by one of using an array [β
, γ
] to define the group center, and scoring each element of the group against the center, such that the element with the highest measure is designated as the center.
-
-
5. The method according to claim 3, wherein said calculating a similarity measure between first and second groups of data points, includes:
-
scoring each element of each group against the center thereof by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
calculating the similarity measure between the centers of two groups.
-
-
6. The method according to claim 1, wherein a distance between groups is determined by 100−
- the similarity measure.
-
31. The method according to claim 1, wherein said model is defined by
-
7. A system for grouping multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, comprising:
-
means for fitting the data into a model relating the independent and dependent variables of the data;
means for calculating similarity and distance between said data points and groups of said data points; and
means for determining, based on calculated similarity and distance, whether to group the multiple data points. - View Dependent Claims (8, 9, 10, 11, 12)
means for appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of observed sales data and markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
m+β
+lnε
, where Y is a vector of size 2n, A is a matrix of [a1 a2], where a1 is a base sale for data set 1 and a2 is a base sale for data set.2, m is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
means for using an adjusted regression coefficient R2 obtained from the fit for the similarity measure, wherein R2 is defined as 1-SSE/SSyy, where SSyy=Σ
(yi−
{overscore (y)})2, y being a mean of all observations of y, and SSE=Σ
(yi−
ŷ
)2, where ŷ
is a predicted value of y, based on the least square model fit, and adjusted R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated; and
means for using a value 100* R2 is used as the similarity measure between the two data sets.
-
-
9. The system according to claim 7, further comprising means for determining centers of groups of said data points.
-
10. The system according to claim 9, wherein said means for calculating said centers of said groups includes:
-
means for appending together the data sets corresponding to all the entities assigned to a cluster, and performing a least square regression;
assuming m entities with n data elements each, means for using a model fitted in log space of in lnY=lnA+γ
M+β
+lnε
, where Y is a vector of size mn, A is a matrix of [a1 a2 . . . am], with a1 being a base sale for data set 1 and am being a base sale for data set m, M is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
means for determining R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average group score; and
means for determining the similarity measure between the element and the center by one of using an array [β
, γ
] to define the group center, and scoring each element of the group against the center, such that the element with the highest measure is designated as the center.
-
-
11. The system according to claim 7, wherein said means for calculating a similarity measure between first and second groups of data points, includes:
-
means for scoring each element of each group against the center thereof by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
means for calculating the similarity measure between the centers of two groups.
-
-
12. The system according to claim 7, wherein a distance between groups is determined by 100−
- the similarity measure.
-
13. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented model-based grouping of multiple data points, each data point being a set comprising a measured dependent value and at least one related independent variable value, said method comprising:
-
fitting the data into a model relating the independent and dependent variables of the data;
calculating similarity and distance between said data points and groups of said data points; and
based on calculated similarity and distance, determining whether to group the multiple data points. - View Dependent Claims (14, 15, 16, 17, 18)
appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of observed sales data and markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
m+β
+lnε
, where Y is a vector of size 2n, A is a matrix of [a1 a2], where a1 is a base sale for data set 1 and a2 is a base sale for data set 2, m is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
using an adjusted regression coefficient R2 obtained from the fit for the similarity measure, wherein R2 is defined as 1-SSE/SSyy, where SSyy=Σ
(yi−
{overscore (y)})2, y being a mean of all observations of y, and SSE=Σ
(yi−
ŷ
)2, where ŷ
is a predicted value of y, based on the least square model fit, and adjusted R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated; and
using the value 100* R2 as the similarity measure between the two data sets.
-
-
15. The signal-bearing medium according to claim 13, further comprising determining centers of groups of said data points.
-
16. The signal-bearing medium according to claim 15, wherein said determining said centers of said groups includes:
-
appending together the data sets corresponding to all entities assigned to a cluster, and performing a least square regression;
assuming m entities with n data elements each, using a model fitted in log space of lnY=lnA+γ
M+β
+lnε
, where Y is a vector of size mn, A is a matrix of [a1 a2 . . . am], with a1 being a base sale for data set 1 and am being a base sale for data set m, M is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
determining R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average group score; and
determining the similarity measure between the element and the center by one of using an array [β
, γ
] to define the group center, and scoring each element of the group against the center, such that the element with the highest measure is designated as the center.
-
-
17. The signal-bearing medium according to claim 15, wherein said calculating a similarity measure between first and second groups of data points, includes:
-
scoring each element of each group against the center thereof by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
calculating the similarity measure between the centers of two groups.
-
-
18. The signal-bearing medium according to claim 13, wherein a distance between groups is determined by 100−
- the similarity measure.
-
19. A method of model-based clustering, comprising:
-
initializing clustering parameters for a plurality of items;
providing a data set for clustering, and cluster center seeds, and calculating an target number of clusters;
incrementing an iteration counter;
scoring each item in the data set against all available cluster centers using a similarity measure process, wherein if a similarity measure value of the item being examined is greater than a minimum first parameter, no further search is performed for the item, and the item is assigned to a particular cluster, and when the similarity measure value is less than said minimum first parameter, the item is assigned to the cluster against which the item scores the highest;
removing clusters having a predetermined low number of assigned items, said removed clusters including items which are unassigned;
updating cluster centers for all remaining clusters;
calculating an overall average cluster score as the average of all the average cluster scores to determine an overall distance, an overall distance being recorded for each iteration performed;
determining whether an iteration is an odd-numbered iteration, wherein if it is determined that the iteration is an odd numbered iteration and that the remaining number of clusters is less than twice the target number calculated, then for each cluster checking a splitting criterion; and
determining whether a cluster is a candidate for splitting based on whether 100−
average cluster score is greater than the overall distance, and whether the cluster has more than twice the minimum number of items needed, wherein an item which scores the least by having a lowest similarity measure against the cluster center is used as a seed for a new cluster to be formed.- View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
assigning all items to clusters using the similarity measure calculation;
determining whether the iteration is an even-numbered iteration, wherein for even-numbered iterations, joining of clusters is attempted, and each cluster is scored against another by using finding a similarity measure between two clusters, and for each cluster a most similar cluster is found;
checking the similarity measure against a parameter, wherein if the similarity score is higher, then that pair of clusters are combined into one cluster by using any one of the centers;
assigning all items to clusters based on said similarity measure; and
checking the iteration number against a maximum iteration parameter, wherein if the iteration number is less than said maximum iteration parameter, the iteration number is incremented, and a sequence is repeated, and wherein if it is determined that the iteration is greater than the maximum iteration parameter, then the process terminates, wherein the iteration with the lowest overall distance is selected, and corresponding assignments of items to clusters, the cluster scores and parameter estimates are used.
-
-
21. The method according to claim 19, wherein said initializing comprises:
-
providing user-input values for the parameters including at least one of the maximum number of clustering iterations to be performed, the minimum number of items needed to form a cluster, the minimum score needed to stop searching and assign an item to a cluster, and the minimum score needed to combine two clusters; and
initializing dimensions of arrays used to read the data set.
-
-
22. The method according to claim 19, wherein said cluster center seeds are used as initial cluster centers.
-
23. The method according to claim 19, wherein said method is applied to retail transactions, and
wherein said determining said similarity measure between first and second entities includes: -
appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of observed sales data and markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
m+β
+lnε
, where Y is a vector of size 2n, A is a matrix of [a1 a2], where a1 is a base sale for data set 1 and a2 is a base sale for data set 2, m is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
using an adjusted regression coefficient R2 obtained from the fit for the similarity measure, wherein R2 is defined as 1-SSE/SSyy, where SSyy=Σ
(yi−
{overscore (y)})2, y being a mean of all observations of y, and SSE=Σ
(yi−
ŷ
)2, where ŷ
is a predicted value of y, based on the least square model fit, and adjusted R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated; and
using the value 100* R2 as the similarity measure between the two data sets.
-
-
24. The method according to claim 19, wherein said updating cluster centers for remaining clusters includes:
-
appending together data sets corresponding to all the entities assigned to a cluster and performing a least square regression, wherein if there are m entities with n data elements each, then the model fitted in the log space is lnY=lnA+γ
M+β
+lnε
, where Y is a vector of size mn, A is a matrix of [a1 a2 . . . am], a1 being a base sale for data set 1 and am being a base sale for data set m), M is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
determining adjusted , R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated, R2 being one of a measure of compactness of the cluster and an average cluster score; and
one of using an array [β
, γ
] to define the cluster center, and scoring each element of the cluster against the center by determining a similarity measure between the element and the center, and selecting the item with the highest measure as the center.
-
-
25. The method according to claim 24, wherein said process of finding a similarity measure between first and second clusters, includes:
-
scoring each element of each cluster against the center by finding the similarity measure between the element and the center, and using the item with the highest measure as the center; and
calculating the similarity measure between the centers of two clusters.
-
-
26. The method according to claim 19, wherein a distance between clusters is determined by 100−
- the similarity measure.
-
27. In a model-based clustering process for a plurality of data points, a method of determining a similarity measure between first and second data points of said plurality of data points includes:
-
appending two data sets together and performing a least square regression fit for an assumed model form using the logarithm of the observed sales data and the markdown data, wherein if the data sets have n entries each, then the model being fitted in the log space is lnY=lnA+γ
m+β
+lnε
, where Y is a vector of size 2n, A is a matrix of [a1 a2], where a1 is a base sale for data set 1 and a2 is a base sale for data set 2, m is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
using an adjusted regression coefficient R2 obtained from the fit for the similarity measure, wherein R2 is defined as 1-SSE/SSyy, where SSyy=Σ
(yi−
{overscore (y)})2, y being a mean of all the observations of y, and SSE=Σ
(yi−
ŷ
)2, where ŷ
is a predicted value of y, based on the least square model fit, and adjusted R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated; and
using the value 100* R2 as the similarity measure between the two data sets.
-
-
28. In a model-based clustering process for a plurality of clusters of data, a method of calculating a similarity measure between first and second clusters of data, includes:
-
determining centers of each of said first and second clusters;
scoring each element of each cluster against a center by finding the similarity measure between the element and the center, and using the element with the highest measure as the center; and
calculating the similarity measure between the centers of two clusters. - View Dependent Claims (29, 30)
appending together the data sets corresponding to all the entities assigned to a cluster, and performing a least square regression;
assuming m entities with n data elements each, using a model fitted in log space of lnY=lnA+γ
M+β
+lnε
, where Y is a vector of size mn, A is a matrix of [a1 a2 . . . am], with a1 being a base sale for data set 1 and am being a base sale for data set m, M is a corresponding vector of markdowns, γ
is a shared price sensitivity factor, and β
is a vector of n shared seasonal indices;
determining R2=1−
(1−
R2)*(n−
1)/(n−
1−
c), where n is a number of observations, and c is a number of coefficients estimated, as a measure of average cluster score; and
determining the similarity measure between the element and the center by one of using an array [β
, γ
] to define the cluster center, and scoring each element of the cluster against the center, such that the element with the highest measure is designated as the center.
-
-
30. The method according to claim 28, wherein a distance between clusters is determined by 100−
- the similarity measure.
Specification