Data clustering based on candidate queries
First Claim
1. A method, including:
- receiving data records, the received data records each including one or more values in one or more fields; and
processing the received data records to identify a matched data cluster to associate with each received data record, the processing including;
for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values;
identifying a first set of one or more candidate data records from the received data records using the first query;
identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records;
determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records;
determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and
selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.
3 Assignments
0 Petitions
Accused Products
Abstract
Received data records, each including one or more values in one or more fields, are processed to identify a matched data cluster. The processing includes: for selected data records, generating a query from one or more values; identifying one or more candidate data records from the received data records using the query; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records; and selecting the matched data cluster from among one or more candidate data clusters based at least in part on a growth criterion for the candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters or based on a result of the growth criterion.
-
Citations
21 Claims
-
1. A method, including:
-
receiving data records, the received data records each including one or more values in one or more fields; and processing the received data records to identify a matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values; identifying a first set of one or more candidate data records from the received data records using the first query; identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records; determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer program stored on a non-transitory computer-readable storage medium, the computer program including instructions for causing a computing system to:
-
receive data records, the received data records each including one or more values in one or more fields; and process the received data records to identify a matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values; identifying a first set of one or more candidate data records from the received data records using the first query; identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records; determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.
-
-
20. A computing system, including:
-
an input device or port configured to receive data records, the received data records each including one or more values in one or more fields; and at least one processor, coupled to a memory, and configured to process the received data records to identify a matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values; identifying a first set of one or more candidate data records from the received data records using the first query; identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records; determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.
-
-
21. A computing system, including:
-
means for receiving data records, the received data records each including one or more values in one or more fields; and means for processing the received data records to identify a matched data cluster to associate with each received data record, the processing including; for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values; identifying a first set of one or more candidate data records from the received data records using the first query; identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records; determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.
-
Specification