Data mining method and system using regression clustering

US 7,539,690 B2
Filed: 10/27/2003
Issued: 05/26/2009
Est. Priority Date: 10/27/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method, comprising:

a processor which performs the following;

selecting a set number of functions correlating variable parameters of a dataset; and

clustering the dataset by iteratively applying a regression algorithm and a K-Harmonic Means performance function on the set number of functions to determine a pattern in said dataset;

wherein said clustering comprises determining distances between data points of the dataset and values correlated with the set number of functions, regressing the set number of functions using data point probability and weighting factors associated with the determined distances, calculating a difference of harmonic averages for the distances determined prior to and subsequent to said regressing, and repeating said regressing, determining and calculating upon determining the difference of harmonic averages is greater than a predetermined value.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system are provided which regressively cluster datapoints from a plurality of data sources without transferring data between the plurality of data sources. In addition, a method and a system are provided which mine data from a dataset by iteratively applying a regression algorithm and a K-Harmonic Means performance function on a set number of functions derived from the dataset.

Citations

22 Claims

1. A method, comprising:
- a processor which performs the following;
  
  selecting a set number of functions correlating variable parameters of a dataset; and
  
  clustering the dataset by iteratively applying a regression algorithm and a K-Harmonic Means performance function on the set number of functions to determine a pattern in said dataset;
  
  wherein said clustering comprises determining distances between data points of the dataset and values correlated with the set number of functions, regressing the set number of functions using data point probability and weighting factors associated with the determined distances, calculating a difference of harmonic averages for the distances determined prior to and subsequent to said regressing, and repeating said regressing, determining and calculating upon determining the difference of harmonic averages is greater than a predetermined value.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The processor-based method of claim 1, wherein said determining the distances comprises determining distances from each datapoint of the dataset to values within each function of the set number of functions.
  - 3. The processor-based method of claim 1, wherein said selecting and said clustering are conducted for a plurality of datasets each from a different data source.
  - 4. The processor-based method of claim 3, wherein said selecting and said clustering are conducted in parallel for each of the plurality of datasets.
  - 5. The processor-based method of claim 3, further comprising determining a common coefficient vector to compensate for variations between similar sets of functions within the different data sources.
  - 6. The processor-based method of claim 5, wherein said determining the common coefficient vector comprises:
    - developing matrices from the dataset datapoints and the probability and weighting factors for each of the datasets prior to said reiterating; and
      
      determining the common coefficient vector from a composite of the developed matrices.
  - 7. The processor-based method of claim 6, further comprising multiplying the similar sets of functions within the different data sources by the common coefficient vector.

8. A storage medium comprising program instructions executable by a processor for:
- selecting a set number of functions correlating variable parameters of a dataset;
  
  determining distances between datapoints of the dataset and values correlated with the set number of functions;
  
  calculating harmonic averages of the distances;
  
  regressing the set number of functions using datapoint probability and weighting factors associated with the determined distances;
  
  repeating said determining and calculating for the regressed set of functions;
  
  computing a change in harmonic averages for the set number of functions prior to and subsequent to said regressing; and
  
  reiterating said regressing, repeating and computing upon determining the change in harmonic averages is greater than a predetermined value to thereby determine a pattern in said dataset.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The storage medium of claim 8, wherein the program instructions are executable using a processor for computing the datapoint probability and weighting factors.
  - 10. The storage medium of claim 8, wherein the program instructions are executable using a processor for developing matrices from the dataset datapoints and the probability and weighting factors prior to said reiterating.
  - 11. The storage medium of claim 10, wherein the program instructions are executable using a processor for amassing matrices developed from a plurality of datasets each from a different data source.
  - 12. The storage medium of claim 10, wherein the program instructions are executable using a processor for determining a common coefficient vector from the composite of matrices.
  - 13. The method of claim 12, wherein the program instructions are executable using a processor for multiplying similar sets of functions within the different data sources by the common coefficient vector.

14. A system, comprising:
- an input port configured to receive data; and
  
  a processor configured to;
  
  regress functions correlating variable parameters of a set of the data;
  
  cluster the functions using a K-Harmonic Mean performance function; and
  
  repeat said regress and cluster sequentially to thereby determine a pattern in said set of data;
  
  wherein the processor clusters the functions by determining distances between data points of the dataset and values correlated with a set number of functions, regressing the set number of functions using data point probability and weighting factors associated with the determined distances, calculating a difference of harmonic averages for the distances determined prior to and subsequent to said regressing.
- View Dependent Claims (15, 16)
- - 15. The system of claim 14, wherein the processor is arranged within one of a plurality of data sources each comprising a processor configured to:
    - regress the functions on a dataset of the respective data source;
      
      cluster the functions using a K-Harmonic Mean performance function; and
      
      repeat said regress and cluster sequentially.
  - 16. The system of claim 14, further comprising a central station coupled to the plurality of data sources, wherein the central station comprises a processor configured to compute common coefficient vectors which compensate for variations between the regressively clustered functions representing the datasets, and wherein each of the processors of the data sources is configured to alter the functions by the common coefficient vectors.

17. A system, comprising:
- a plurality of data sources; and
  
  a means for regressively clustering datapoints from the plurality of data sources without transferring data between the plurality of data sources to thereby determine a pattern in data contained in said data sources and for applying a K-Harmonic Means performance function on the data;
  
  wherein the means for regressively clustering the datasets comprises a storage medium with program instructions executable using a processor for selecting a set number of functions correlating variable parameters of a dataset, determining distances between data points of the dataset and values correlated with the set number of functions, regressing the set number of functions using data point probability and weighting factors associated with the determined distances, calculating a difference of harmonic averages for the distances determined prior to and subsequent to said regressing; and
  
  reiterating said regressing, determining and calculating upon determining the difference of harmonic averages is less than a predetermined value.
- View Dependent Claims (18)
- - 18. The system of claim 17, further comprising a central station communicably coupled to the plurality of data sources, wherein the means is further for:
    - collecting dataset information at the central station from the plurality of data sources;
      
      determining a common coefficient vector from the collected dataset information; and
      
      altering datasets within the plurality of data sources by the common coefficient vector.

19. A system, comprising:
- a plurality of data sources each having a processor configured to access datapoints within the respective data source; and
  
  a central station coupled to the plurality of data sources and comprising a processor, wherein the processors of the central station and plurality of data sources are collectively configured to mine the datapoints of the data sources as a whole without transferring all of the datapoints between the data sources and the central station to thereby determine a pattern in datapoints contained in said data sources;
  
  wherein the each of the processors within the plurality of data sources is configured to regressively cluster a dataset within the respective data source;
  
  wherein the processor within the central station is configured to;
  
  collect information pertaining to the regressively clustered datasets;
  
  based upon the collected information, calculate common coefficient vectors which balance variations between functions correlating similar variable parameters of the regressively clustered datasets;
  
  compute a residual error from the common coefficient vectors;
  
  propagate the common coefficient vectors to the data sources upon computing a residual error value greater than a predetermined value; and
  
  send a message to the data sources to terminate the regression clustering of the datasets upon computing a residual error value less than a predetermined value.

20. A processor-based method for mining data, comprising:
- independently applying a regression clustering algorithm to a plurality of distributed datasets by determining distances between data points of each dataset and values correlated with a set number of functions, regressing the set number of functions using data point probability and weighting factors associated with the determined distances, calculating a difference of harmonic averages for the distances determined prior to and subsequent to application of said regression algorithm, and repeating said regressing, determining and calculating upon determining the difference of harmonic averages is greater than a predetermined value;
  
  developing matrices from probability and weighting factors computed from the regression clustering algorithm, wherein the matrices individually represent the distributed datasets without including all datapoints within the datasets;
  
  determining global coefficient vectors from a composite of the matrices; and
  
  multiplying functions correlating similar variable parameters of the distributed datasets by the global coefficient vectors to thereby determine a pattern in said datasets.
- View Dependent Claims (21, 22)
- - 21. The processor-based method of claim 20, further comprising repeating said independently applying, said developing, said determining and said multiplying.
  - 22. The processor-based method of claim 20, further comprising calculating a residue error associated with the global coefficients prior to said multiplying.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Zhang, Bin
Primary Examiner(s)
Vo; Tim
Assistant Examiner(s)
Morrison; Jay A

Application Number

US10/694,367
Publication Number

US 20050091189A1
Time in Patent Office

2,038 Days
Field of Search

702/181, 702/179
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/90   Details of database functio...

G06F 18/2321   using statistics or functio...

Y10S 707/99937   Sorting

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Data mining method and system using regression clustering

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Data mining method and system using regression clustering

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links