Iterative validation and sampling-based clustering using error-tolerant frequent item sets
First Claim
1. A computer-implemented method for determining a set of error-tolerant frequent item sets within a database of data organized into records and dimensions comprising:
- determining a sample set of error-tolerant frequent item sets comprising a set of defining dimensions within a uniform random sample of the data within the database;
validating the sample set of error-tolerant frequent item sets;
determining the set of error-tolerant frequent item sets as including the sample set of error-tolerant frequent item sets as validated;
repeating determining an additional sample set of error-tolerant frequent item sets within additional uniform samples mutually exclusive from prior uniform samples from which sample sets of error-tolerant frequent item sets were determined, validating the additional sample set, and determining the set of error-tolerant frequent item sets as including the additional sample set as validated, until the additional sample set is empty.
2 Assignments
0 Petitions
Accused Products
Abstract
Iterative validation for efficiently determining error-tolerant frequent itemsets is disclosed. A description of the application of error-tolerant frequent itemsets to efficiently determining clusters as well as initializing clustering algorithms are also given. In one embodiment, a method determines a sample set of error-tolerant frequent itemsets (ETF'"'"'s) within a uniform random sample of data within a database. This sample set of ETF'"'"'s is independently validated, so that, for example, spurious ETF'"'"'s and spurious dimensions within the ETF'"'"'s can be removed. The validated sample set of ETF'"'"'s, is added to the set of ETF'"'"'s for the database. This process is repeated with additional uniform samples that are mutually exclusive from prior uniform samples, to continue building the database'"'"'s set of ETF'"'"'s, until no new sample sets can be found. The method is significantly more efficient than disk-based methods in the prior art, and the data clusters found are often not discovered by traditional clustering algorithm in the prior art.
-
Citations
31 Claims
-
1. A computer-implemented method for determining a set of error-tolerant frequent item sets within a database of data organized into records and dimensions comprising:
-
determining a sample set of error-tolerant frequent item sets comprising a set of defining dimensions within a uniform random sample of the data within the database;
validating the sample set of error-tolerant frequent item sets;
determining the set of error-tolerant frequent item sets as including the sample set of error-tolerant frequent item sets as validated;
repeating determining an additional sample set of error-tolerant frequent item sets within additional uniform samples mutually exclusive from prior uniform samples from which sample sets of error-tolerant frequent item sets were determined, validating the additional sample set, and determining the set of error-tolerant frequent item sets as including the additional sample set as validated, until the additional sample set is empty. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer-implemented method for clustering a database of data organized into records and dimensions comprising:
-
determining a first sample set of clusters within a uniform sample of the data within the database;
validating the sample set of clusters by testing the first sample set of clusters against a validation random sample of the data within the database;
determining a result set of clusters as including the first sample set of clusters as validated;
repeating determining an additional sample set of clusters within an additional uniform sample that is mutually exclusive from prior uniform samples from which the result set of clusters were determined, validating the additional sample set, and determining the result set of clusters as including both the first and any additional sample sets as validated, until the additional sample set is empty. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. A machine-readable medium having instructions stored thereon for execution by a processor of a computer to perform a method for determining a set of error-tolerant frequent item sets within a database of data organized into records and dimensions comprising:
-
determining a uniform sample of the data within the database that fits into memory of the computer;
determining a sample set of error-tolerant frequent item sets comprising a set of defining dimensions within the uniform sample;
validating the sample set of error-tolerant frequent item sets by testing the sample set of error-tolerant frequent item sets against a validation random sample of the data within the database that is mutually exclusive with the random sample;
determining the set of error-tolerant frequent item sets as including the sample set of error tolerant frequent item sets as validated; and
,repeating determining an additional uniform sample that is mutually exclusive from prior uniform samples from which sample sets of error-tolerant frequent item sets were determined, determining an additional sample set of error-tolerant frequent item sets within the additional uniform sample, validating the additional sample set, and determining the set of error-tolerant frequent item sets as including the additional sample set as validated, until the additional sample set is empty. - View Dependent Claims (24, 25, 26)
-
-
27. A machine-readable medium having instructions stored thereon for execution by a processor of a computer to perform a method for clustering a database of data organized into records and dimensions comprising:
-
determining a uniform sample of the data within the database that fits into memory of the computer;
determining a first sample set of clusters within the uniform sample;
validating the sample set of clusters by testing the sample set of clusters against a validation random sample of the data within the database that is mutually exclusive with the random sample;
determining a result set of clusters as including the sample set of clusters as validated;
repeating determining an additional uniform sample that is mutually exclusive from prior uniform samples from which the result set of clusters were determined, determining an additional sample set of clusters within the additional uniform sample, validating the additional sample set, and determining the result set of clusters as including both the first and any additional sample sets as validated, until the additional sample set is empty. - View Dependent Claims (28, 29, 30, 31)
-
Specification