SYSTEMS AND METHODS FOR AUTOMATIC CLUSTERING AND CANONICAL DESIGNATION OF RELATED DATA IN VARIOUS DATA STRUCTURES

US 20170052958A1
Filed: 08/10/2016
Published: 02/23/2017
Est. Priority Date: 08/19/2015
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a data store configured to store computer-executable instructions and a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields;

a computing device including a processor in communication with the data store, the processor configured to execute the computer-executable instructions to at least;

identify, based at least in part on a first field of the one or more fields, a first group of the plurality of records;

divide the first group into one or more record pairs, each of the one or more record pairs comprising a respective first record and second record;

determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity;

identify a cluster of record pairs, wherein each pair in the cluster has a record in common with at least one other pair in the cluster, and wherein each pair in the cluster has a respective match score above a threshold; and

output the cluster of record pairs to a client computing device.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a respective entity by grouping the records into pairs, analyzing the respective pairs to determine a probability that both members of the pair relate to a common entity, and identifying a cluster of overlapping pairs to generate a collection of records relating to a common entity. Clusters may further be analyzed to determine canonical names or other properties for the respective entities by analyzing record fields and identifying similarities.

34 Citations

View as Search Results

20 Claims

1. A system comprising:
- a data store configured to store computer-executable instructions and a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields;
  
  a computing device including a processor in communication with the data store, the processor configured to execute the computer-executable instructions to at least;
  
  identify, based at least in part on a first field of the one or more fields, a first group of the plurality of records;
  
  divide the first group into one or more record pairs, each of the one or more record pairs comprising a respective first record and second record;
  
  determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity;
  
  identify a cluster of record pairs, wherein each pair in the cluster has a record in common with at least one other pair in the cluster, and wherein each pair in the cluster has a respective match score above a threshold; and
  
  output the cluster of record pairs to a client computing device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein the processor is further configured to execute the computer-executable instructions to at least:
    - determine, based at least in part on a first pair in the cluster of pairs, a first candidate name to associate with the cluster;
      
      determine, based at least in part on a second pair in the cluster of pairs, a second candidate name based to associate with the cluster; and
      
      determine a name to associate with the cluster based at least in part on the first candidate name and the second candidate name.
  - 3. The system of claim 2, wherein determining the first candidate name is based at least in part on a first field of the first record and a corresponding second field of the second record.
  - 4. The system of claim 3, wherein determining the first candidate name comprises identifying a longest common substring of the first field and the second field.
  - 5. The system of claim 3, wherein determining the first candidate name is based at least in part on calculating a Levenshtein distance between a first field of the first record and a corresponding second field of the second record.
  - 6. The system of claim 1, wherein the processor is further configured to execute the computer-executable instructions to identify the first group of the plurality of records by at least:
    - accessing a first record, a second record, and a third record of the plurality of records;
      
      accessing a blocking model including information indicative of at least a first field and a second field to be compared between candidate pairs of records;
      
      comparing a value of the first field of the first record with a value of the first field of the second record to determine first matching fields;
      
      comparing a value of the second field of the first record with a value of the second field of the second record to determine second matching fields;
      
      in response to determining the first matching fields and the second matching fields, grouping the first record and the second record into the first group;
      
      comparing the value of the first field of the second record with a value of the first field of the third record to determine third matching fields;
      
      comparing the value of the second field of the second record with a value of the second field of the third record to determine fourth matching fields; and
      
      in response to determining the third matching fields and the fourth matching fields, adding the third record to the first group.
  - 7. The system of claim 6, wherein determining at least one of the first, second, third, or fourth matching fields is based on a soft or fuzzy match.
  - 8. The system of claim 6, wherein determining at least one of the first, second, third, or fourth matching fields is based on a weighting.
  - 9. The system of claim 1, wherein the processor is further configured to execute the computer-executable instructions to identify the first group of the plurality of records by at least:
    - accessing a first record, a second record, and a third record of the plurality of records;
      
      accessing a blocking model including information indicative of at least a first field to be compared between candidate pairs of records;
      
      comparing a value of the first field of the first record with a value of the first field of the second record to determine first matching fields;
      
      in response to determining the first matching fields, grouping the first record and the second record into the first group;
      
      comparing a value of the first field of the second record with a value of the first field of the third record to determine that the fields do not match;
      
      comparing the value of the first field of the second record with a value of the first field of the third record to determine second matching fields;
      
      in response to determining the second matching fields, adding the third record to the first group.
  - 10. The system of claim 1, wherein the processor is further configured to execute the computer-executable instructions to at least:
    - validate first group of the plurality of record by at least one of;
      
      determining that a diameter of the first group satisfies a threshold,determining that a size of the first group satisfies a threshold,determining a distribution of sizes of groups including the first group satisfies a distribution rule, ordetermining an entropy of groups including the first group satisfies an entropy rule.

11. A method comprising:
- obtaining a first plurality of records, wherein each record of the first plurality of records is associated with a respective entity and comprises a first one or more fields;
  
  obtaining a second plurality of records, wherein each record of the second plurality of records is associated with a respective entity and comprises a second one or more fields, and wherein no two records of the second plurality of records are associated with the same entity;
  
  identifying, based at least in part on a first field of the first one or more fields, a first subset of the first plurality of records;
  
  identifying, based at least in part on a second field of the second one or more fields, a second subset of the second plurality of records;
  
  generating a plurality of record pairs, wherein each record pair in the plurality of record pairs comprises a respective first record from the first subset and a respective second record from the second subset;
  
  determining a respective match score for each of the plurality of record pairs, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity;
  
  identifying, for each record in the first subset, a respective cluster of record pairs, wherein each record pair in the cluster includes the record;
  
  identifying, for each cluster of record pairs, a respective matching record pair based at least in part on the match scores of the record pairs in the cluster; and
  
  outputting the matching record pairs to a client computing device.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method of claim 11, wherein identifying the respecting matching record pair for each cluster comprises identifying a record pair having a highest match score.
  - 13. The method of claim 11, wherein determining a match score is based at least in part on one or more reference pairs.
  - 14. The method of claim 13, wherein the one or more reference pairs each comprise a first matched record associated with a first entity and a second matched record associated with the first entity.
  - 15. The method of claim 13, wherein the one or more reference pairs each comprise a first unmatched record associated with a first entity and a second unmatched record associated with a second entity.
  - 16. The method of claim 11 further comprising:
    - identifying an indeterminate record pair of the plurality of record pairs, the indeterminate record pair having a match score indicating a least certainty of whether the first record and second record of the indeterminate record pair are associated with the same entity;
      
      outputting the indeterminate record pair to a user;
      
      receiving, from the user, an indication that the first record and the second record of the indeterminate record pair are associated with the same entity;
      
      calculating, for each of the plurality of record pairs, a respective revised match score based at least in part on the indication;
      
      wherein identifying the respective matching record pair for each cluster of record pairs is further based at least in part on the revised match scores of the record pairs in the cluster.

17. A non-transitory computer-readable storage medium including computer-executable instructions that, when executed by a processor, cause the processor to:
- obtain a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields;
  
  divide at least a portion of the plurality of records into one or more record pairs, each of the one or more record pairs comprising a respective first record and second record;
  
  determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pairs are associated with a respective same entity; and
  
  identify a first cluster of record pairs, wherein each pair in the first cluster has a record in common with at least one other pair in the first cluster, and wherein each pair in the first cluster has a respective match score above a first threshold.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer-readable storage medium of claim 17, wherein the computer-executable instructions that cause the processor to determine a respective match score for each of the one or more record pairs comprise computer-executable instructions that cause the processor to:
    - obtain a plurality of reference pairs, each of the plurality of reference pairs comprising a respective first record and second record, wherein the respective first record and second record of a reference pair are associated with the same entity;
      
      determine a first plurality of match scores according to a first model, wherein the first plurality of match scores corresponds to the plurality of reference pairs;
      
      determine a second plurality of match scores according to a second model, wherein the second plurality of match scores corresponds to the plurality of reference pairs;
      
      determine, based at least in part on the first plurality of match scores, a first accuracy score for the first model;
      
      determine, based at least in part on the second plurality of match scores, a second accuracy score for the second model;
      
      wherein the computer-executable instructions that cause the processor to determine the respective match score for each of the plurality of record pairs cause the processor to determine the respective match score according to the model having the higher accuracy score.
  - 19. The non-transitory computer-readable storage medium of claim 17, wherein the computer-executable instructions further cause the processor to:
    - output the first cluster of record pairs to a client computing device;
      
      receive, from the client computing device, a second threshold;
      
      identify a second cluster of record pairs, wherein each pair of the second cluster has a record in common with at least one other pair in the second cluster, and wherein each pair in the second cluster has a respective match score above the second threshold; and
      
      output the second cluster to the client computing device.
  - 20. The non-transitory computer-readable storage medium of claim 17, wherein the computer-executable instructions further cause the processor to generate one or more normalized fields for the plurality of records, and wherein the respective match score for each of the one or more record pairs is based at least in part on the one or more normalized fields.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Manning, Lawrence, Hu, Roger, Falco, Xavier, Gilmore, Rowan, Prestinario, Jason, Huang, Yifei, Fernandez, Daniel, Sader, Clayton, Elkherj, Matthew, Latourette, Nicholas, Zamoshchin, Aleksandr, Mehta, Rahul, Erenrich, Daniel, Visa, Guillem Palou, Bingham, Eli, Elser, Jeremy, Agarwal, Rahul

Granted Patent

US 10,127,289 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/9535   Search customisation based ...

G06F 18/23   Clustering techniques

SYSTEMS AND METHODS FOR AUTOMATIC CLUSTERING AND CANONICAL DESIGNATION OF RELATED DATA IN VARIOUS DATA STRUCTURES

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEMS AND METHODS FOR AUTOMATIC CLUSTERING AND CANONICAL DESIGNATION OF RELATED DATA IN VARIOUS DATA STRUCTURES

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links