Iterative validation and sampling-based clustering using error-tolerant frequent item sets

US 6,490,582 B1
Filed: 02/08/2000
Issued: 12/03/2002
Est. Priority Date: 02/08/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for determining a set of error-tolerant frequent item sets within a database of data organized into records and dimensions comprising:

determining a sample set of error-tolerant frequent item sets comprising a set of defining dimensions within a uniform random sample of the data within the database;

validating the sample set of error-tolerant frequent item sets;

determining the set of error-tolerant frequent item sets as including the sample set of error-tolerant frequent item sets as validated;

repeating determining an additional sample set of error-tolerant frequent item sets within additional uniform samples mutually exclusive from prior uniform samples from which sample sets of error-tolerant frequent item sets were determined, validating the additional sample set, and determining the set of error-tolerant frequent item sets as including the additional sample set as validated, until the additional sample set is empty.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Iterative validation for efficiently determining error-tolerant frequent itemsets is disclosed. A description of the application of error-tolerant frequent itemsets to efficiently determining clusters as well as initializing clustering algorithms are also given. In one embodiment, a method determines a sample set of error-tolerant frequent itemsets (ETF'"'"'s) within a uniform random sample of data within a database. This sample set of ETF'"'"'s is independently validated, so that, for example, spurious ETF'"'"'s and spurious dimensions within the ETF'"'"'s can be removed. The validated sample set of ETF'"'"'s, is added to the set of ETF'"'"'s for the database. This process is repeated with additional uniform samples that are mutually exclusive from prior uniform samples, to continue building the database'"'"'s set of ETF'"'"'s, until no new sample sets can be found. The method is significantly more efficient than disk-based methods in the prior art, and the data clusters found are often not discovered by traditional clustering algorithm in the prior art.

Citations

31 Claims

1. A computer-implemented method for determining a set of error-tolerant frequent item sets within a database of data organized into records and dimensions comprising:
- determining a sample set of error-tolerant frequent item sets comprising a set of defining dimensions within a uniform random sample of the data within the database;
  
  validating the sample set of error-tolerant frequent item sets;
  
  determining the set of error-tolerant frequent item sets as including the sample set of error-tolerant frequent item sets as validated;
  
  repeating determining an additional sample set of error-tolerant frequent item sets within additional uniform samples mutually exclusive from prior uniform samples from which sample sets of error-tolerant frequent item sets were determined, validating the additional sample set, and determining the set of error-tolerant frequent item sets as including the additional sample set as validated, until the additional sample set is empty.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, further comprising initially determining the uniform sample of the data within the database.
  - 3. The method of claim 1, wherein the uniform sample fits into memory of a computer on which the method is being implemented.
  - 4. The method of claim 1, wherein validating the sample set of error-tolerant frequent item sets comprises testing the sample set of error-tolerant frequent item sets against a validation random sample of the data within the database that is mutually exclusive with the random sample.
  - 5. The method of claim 4, wherein testing the sample set of error-tolerant frequent item sets against a validation random sample comprises determining the validation random sample.
  - 6. The method of claim 4, wherein testing the sample set of error-tolerant frequent item sets against a validation random sample comprises identifying spurious sets of items within the sample set of error-tolerant frequent item sets, and upon so identifying, removing the spurious sets of items from the sample set of error-tolerant frequent item sets.
  - 7. The method of claim 4, wherein testing the sample set of error-tolerant frequent item sets against a validation random sample comprises identifying spurious defining dimensions within the error-tolerant frequent item sets of the sample set of error-tolerant frequent item sets, and upon so identifying, removing the spurious defining dimensions from the error-tolerant frequent item sets of the sample set of error-tolerant frequent item sets.
  - 8. The method of claim 1, wherein the data of the database comprises at least one of transactional and binary data.
  - 9. The method of claim 1, wherein the data comprises non-binary data, and the method initially comprises transforming the non-binary data into binary data.
  - 10. The method of claim 9, wherein the data comprises categorical discrete data.
  - 11. The method of claim 9, wherein the data comprises continuous data.
  - 12. The method of claim 1, wherein an error-tolerant frequent item set comprises a cluster, such that the method is for clustering the database.
  - 13. The method of claim 1, wherein an error-tolerant frequent item set comprises a cluster defined as a set of records such that the set of records includes at least a predetermined minimum threshold number of records, and for each of the set of records, the fraction of values not equal to a predetermined value over the defining dimensions is not greater than a predetermined maximum error threshold.
  - 14. The method of claim 13, wherein the data of the database comprises binary data, and the predetermined value comprises one of zero and one.

15. A computer-implemented method for clustering a database of data organized into records and dimensions comprising:
- determining a first sample set of clusters within a uniform sample of the data within the database;
  
  validating the sample set of clusters by testing the first sample set of clusters against a validation random sample of the data within the database;
  
  determining a result set of clusters as including the first sample set of clusters as validated;
  
  repeating determining an additional sample set of clusters within an additional uniform sample that is mutually exclusive from prior uniform samples from which the result set of clusters were determined, validating the additional sample set, and determining the result set of clusters as including both the first and any additional sample sets as validated, until the additional sample set is empty.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
- - 16. The method of claim 15, further comprising initially determining the uniform sample of the data within the database.
  - 17. The method of claim 15, wherein the uniform sample fits into memory of a computer on which the method is being implemented.
  - 18. The method of claim 15, wherein the validation random sample of the data within the database that is mutually exclusive with the random sample.
  - 19. The method of claim 18, wherein testing the sample set of clusters against a validation random sample comprises determining the validation random sample.
  - 20. The method of claim 18, wherein testing the sample set of clusters against a validation random sample comprises identifying spurious clusters within the sample set of clusters, and upon so identifying, removing the spurious clusters from the sample set of clusters.
  - 21. The method of claim 18, wherein testing the sample set of clusters against a validation random sample comprises identifying spurious defining dimensions within the clusters of the sample set of clusters, and upon so identifying, removing the spurious defining dimensions from the clusters of the sample set of clusters.
  - 22. The method of claim 15, wherein a cluster is defined as a set of records and a set of defining dimensions within the data of the database such that the set of records includes at least a predetermined minimum threshold number of records, and for each of the set of records, the fraction of values not equal to a predetermined value (errors) over the defining dimensions is not greater than a predetermined maximum error threshold.

23. A machine-readable medium having instructions stored thereon for execution by a processor of a computer to perform a method for determining a set of error-tolerant frequent item sets within a database of data organized into records and dimensions comprising:
- determining a uniform sample of the data within the database that fits into memory of the computer;
  
  determining a sample set of error-tolerant frequent item sets comprising a set of defining dimensions within the uniform sample;
  
  validating the sample set of error-tolerant frequent item sets by testing the sample set of error-tolerant frequent item sets against a validation random sample of the data within the database that is mutually exclusive with the random sample;
  
  determining the set of error-tolerant frequent item sets as including the sample set of error tolerant frequent item sets as validated; and
  
  , repeating determining an additional uniform sample that is mutually exclusive from prior uniform samples from which sample sets of error-tolerant frequent item sets were determined, determining an additional sample set of error-tolerant frequent item sets within the additional uniform sample, validating the additional sample set, and determining the set of error-tolerant frequent item sets as including the additional sample set as validated, until the additional sample set is empty.
- View Dependent Claims (24, 25, 26)
- - 24. The medium of claim 23, wherein testing the sample set of error-tolerant frequent item sets against a validation random sample comprises determining the validation random sample.
  - 25. The medium of claim 23, wherein testing the sample set of error-tolerant frequent item sets against a validation random sample comprises identifying spurious error-tolerant frequent item sets within the sample set of error-tolerant frequent item sets, and upon so identifying, removing the spurious error-tolerant frequent item sets from the sample set of error-tolerant frequent item sets.
  - 26. The medium of claim 23, wherein testing the sample set of error-tolerant frequent item sets against a validation random sample comprises identifying spurious defining dimensions within the error-tolerant frequent item sets of the sample set of error-tolerant frequent item sets, and upon so identifying, removing the spurious defining dimensions from the error-tolerant frequent item sets of the sample set of error-tolerant frequent item sets.

27. A machine-readable medium having instructions stored thereon for execution by a processor of a computer to perform a method for clustering a database of data organized into records and dimensions comprising:
- determining a uniform sample of the data within the database that fits into memory of the computer;
  
  determining a first sample set of clusters within the uniform sample;
  
  validating the sample set of clusters by testing the sample set of clusters against a validation random sample of the data within the database that is mutually exclusive with the random sample;
  
  determining a result set of clusters as including the sample set of clusters as validated;
  
  repeating determining an additional uniform sample that is mutually exclusive from prior uniform samples from which the result set of clusters were determined, determining an additional sample set of clusters within the additional uniform sample, validating the additional sample set, and determining the result set of clusters as including both the first and any additional sample sets as validated, until the additional sample set is empty.
- View Dependent Claims (28, 29, 30, 31)
- - 28. The medium of claim 27, wherein testing the sample set of clusters against a validation random sample comprises determining the validation random sample.
  - 29. The medium of claim 27, wherein testing the sample set of clusters against a validation random sample comprises identifying spurious clusters within the sample set of clusters, and upon so identifying, removing the spurious clusters from the sample set of clusters.
  - 30. The medium of claim 27, wherein testing the sample set of clusters against a validation random sample comprises identifying spurious defining dimensions within the clusters of the sample set of clusters, and upon so identifying, removing the spurious defining dimensions from the clusters of the sample set of clusters.
  - 31. The medium of claim 27, wherein a cluster is defined as a set of records and a set of defining dimensions within the data of the database such that the set of records includes at least a predetermined minimum threshold number of records, and for each of the set of records, the fraction of a predetermined value over the defining dimensions is not greater than a predetermined maximum error threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Yang, Cheng, Fayyad, Usama M., Bradley, Paul S.
Primary Examiner(s)
Breene, John
Assistant Examiner(s)
Wong, Leslie

Application Number

US09/500,172
Time in Patent Office

1,029 Days
Field of Search

707/3, 707/6
US Class Current

707/777
CPC Class Codes

G06F 16/35 Clustering; Classification

Y10S 707/99936 Pattern matching access

Iterative validation and sampling-based clustering using error-tolerant frequent item sets

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Iterative validation and sampling-based clustering using error-tolerant frequent item sets

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links