SYSTEMS AND METHODS FOR AUTOMATIC CLUSTERING AND CANONICAL DESIGNATION OF RELATED DATA IN VARIOUS DATA STRUCTURES
First Claim
1. A system comprising:
- a data store configured to store computer-executable instructions and a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields;
a computing device including a processor in communication with the data store, the processor configured to execute the computer-executable instructions to at least;
identify, based at least in part on a first field of the one or more fields, a first group of the plurality of records;
divide the first group into one or more record pairs, each of the one or more record pairs comprising a respective first record and second record;
determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity;
identify a cluster of record pairs, wherein each pair in the cluster has a record in common with at least one other pair in the cluster, and wherein each pair in the cluster has a respective match score above a threshold; and
output the cluster of record pairs to a client computing device.
8 Assignments
0 Petitions
Accused Products
Abstract
Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a respective entity by grouping the records into pairs, analyzing the respective pairs to determine a probability that both members of the pair relate to a common entity, and identifying a cluster of overlapping pairs to generate a collection of records relating to a common entity. Clusters may further be analyzed to determine canonical names or other properties for the respective entities by analyzing record fields and identifying similarities.
34 Citations
20 Claims
-
1. A system comprising:
-
a data store configured to store computer-executable instructions and a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields; a computing device including a processor in communication with the data store, the processor configured to execute the computer-executable instructions to at least; identify, based at least in part on a first field of the one or more fields, a first group of the plurality of records; divide the first group into one or more record pairs, each of the one or more record pairs comprising a respective first record and second record; determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity; identify a cluster of record pairs, wherein each pair in the cluster has a record in common with at least one other pair in the cluster, and wherein each pair in the cluster has a respective match score above a threshold; and output the cluster of record pairs to a client computing device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method comprising:
-
obtaining a first plurality of records, wherein each record of the first plurality of records is associated with a respective entity and comprises a first one or more fields; obtaining a second plurality of records, wherein each record of the second plurality of records is associated with a respective entity and comprises a second one or more fields, and wherein no two records of the second plurality of records are associated with the same entity; identifying, based at least in part on a first field of the first one or more fields, a first subset of the first plurality of records; identifying, based at least in part on a second field of the second one or more fields, a second subset of the second plurality of records; generating a plurality of record pairs, wherein each record pair in the plurality of record pairs comprises a respective first record from the first subset and a respective second record from the second subset; determining a respective match score for each of the plurality of record pairs, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity; identifying, for each record in the first subset, a respective cluster of record pairs, wherein each record pair in the cluster includes the record; identifying, for each cluster of record pairs, a respective matching record pair based at least in part on the match scores of the record pairs in the cluster; and outputting the matching record pairs to a client computing device. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable storage medium including computer-executable instructions that, when executed by a processor, cause the processor to:
-
obtain a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields; divide at least a portion of the plurality of records into one or more record pairs, each of the one or more record pairs comprising a respective first record and second record; determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity; and identify a first cluster of record pairs, wherein each pair in the first cluster has a record in common with at least one other pair in the first cluster, and wherein each pair in the first cluster has a respective match score above a first threshold. - View Dependent Claims (18, 19, 20)
-
Specification