Data clustering based on candidate queries
First Claim
1. A method, including:
- receiving data records, the received data records each including one or more values in one or more fields; and
processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query;
identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure;
determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster;
determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and
selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.
3 Assignments
0 Petitions
Accused Products
Abstract
Received data records, each including one or more values in one or more fields, are processed to identify a matched data cluster. The processing includes: for selected data records, generating a query from one or more values; identifying one or more candidate data records from the received data records using the query; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records; and selecting the matched data cluster from among one or more candidate data clusters based at least in part on a growth criterion for the candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters or based on a result of the growth criterion.
90 Citations
101 Claims
-
1. A method, including:
-
receiving data records, the received data records each including one or more values in one or more fields; and processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query; identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster; determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26, 27, 28)
-
-
20. A computer program stored on a non-transitory computer-readable medium, the computer program including instructions for causing a computing system to:
-
receive data records, the received data records each including one or more values in one or more fields; and process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query; identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster; determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
-
-
21. A computing system, including:
-
an input device or port configured to receive data records, the received data records each including one or more values in one or more fields; and at least one processor coupled to memory storing at least some data records, the processor configured to process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query; identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster; determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters. - View Dependent Claims (61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84)
-
-
22. A computing system, including:
-
means for receiving data records, the received data records each including one or more values in one or more fields; and means for processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query; identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster; determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.
-
-
29. A method, including:
-
receiving data records, the received data records each including one or more values in one or more fields; processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query; identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and receiving, in a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36)
-
-
85. A computer program stored on a non-transitory computer-readable medium, the computer program including instructions for causing a computing system to:
-
receive data records, the received data records each including one or more values in one or more fields; process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query; identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and receive, in a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster. - View Dependent Claims (86, 87, 88, 89, 90, 91, 92)
-
-
93. A computing system, including:
-
an input device or port configured to receive data records, the received data records each including one or more values in one or more fields; at least one processor coupled to memory storing at least some data records, the processor configured to process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query; identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, configured to receive user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster. - View Dependent Claims (94, 95, 96, 97, 98, 99, 100)
-
-
101. A computing system, including:
-
means for receiving data records, the received data records each including one or more values in one or more fields; means for processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query; identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure; determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and means for receiving, in a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster.
-
Specification