Method and apparatus for rapid identification of column heterogeneity
First Claim
Patent Images
1. A computer-implemented method for identifying data heterogeneity, the method comprising:
- receiving data associated with a column in a database;
computing a cluster entropy solely for the data of the column as a measure of data heterogeneity, wherein the cluster entropy is computed by;
determining a plurality of soft clusters from the data;
assigning a probability to each of the plurality of soft clusters equal to a fraction of data points of the data that each of the plurality of soft clusters contains; and
computing the cluster entropy based on a resulting distribution of the plurality of soft clusters, wherein the entropy of the resulting distribution comprises the cluster entropy;
determining, via a processor, whether the data of the column is heterogeneous in accordance with the cluster entropy; and
providing a determination of whether the data of the column is heterogeneous as an output to a user.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for rapid identification of column heterogeneity in databases are disclosed. For example, the method receives data associated with a column in a database. The method computes a cluster entropy for the data as a measure of data heterogeneity and then determines whether said data is heterogeneous in accordance with the cluster entropy.
-
Citations
16 Claims
-
1. A computer-implemented method for identifying data heterogeneity, the method comprising:
-
receiving data associated with a column in a database; computing a cluster entropy solely for the data of the column as a measure of data heterogeneity, wherein the cluster entropy is computed by; determining a plurality of soft clusters from the data; assigning a probability to each of the plurality of soft clusters equal to a fraction of data points of the data that each of the plurality of soft clusters contains; and computing the cluster entropy based on a resulting distribution of the plurality of soft clusters, wherein the entropy of the resulting distribution comprises the cluster entropy; determining, via a processor, whether the data of the column is heterogeneous in accordance with the cluster entropy; and providing a determination of whether the data of the column is heterogeneous as an output to a user. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A non-transitory computer-readable storage medium having stored thereon a plurality of instructions, the plurality of instructions including instructions which, when executed by a processor, cause the processor to perform a method for identifying data heterogeneity, comprising:
-
receiving data associated with a column in a database; computing a cluster entropy solely for the data of the column as a measure of data heterogeneity, wherein the cluster entropy is computed by; determining a plurality of soft clusters from the data; assigning a probability to each of the plurality of soft clusters equal to a fraction of data points of the data that each of the plurality of soft clusters contains; and computing an entropy of a resulting distribution of the plurality of soft clusters, wherein the entropy of the resulting distribution comprises the cluster entropy; and determining whether the data of the column is heterogeneous in accordance with the cluster entropy. - View Dependent Claims (7, 8, 9, 10, 11)
-
-
12. An apparatus for identifying data heterogeneity, the apparatus comprising:
-
a processor; and a non-transitory computer-readable storage medium in communication with the processor, the computer-readable medium having stored thereon a plurality of instructions, the plurality of instructions including instructions which, when executed by the processor, cause the processor to perform a method comprising; receiving data associated with a column in a database; computing a cluster entropy solely for the data of the column as a measure of data heterogeneity, wherein the cluster entropy is computed by; determining a plurality of soft clusters from the data; assigning a probability to each of the plurality of soft clusters equal to a fraction of data points of the data that each of the plurality of soft clusters contains; and computing an entropy of a resulting distribution of the plurality of soft clusters, wherein the entropy of the resulting distribution comprises the cluster entropy; and determining whether the data of the column is heterogeneous in accordance with the cluster entropy. - View Dependent Claims (13, 14, 15, 16)
-
Specification