Internal linking co-convergence using clustering with hierarchy

US 9,037,606 B2
Filed: 09/17/2013
Issued: 05/19/2015
Est. Priority Date: 02/04/2003
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

clustering hierarchical database records into a first set of clusters having corresponding first cluster identifications (IDs), each hierarchical database record comprising one or more field values, the clustering based at least in part on determining similarity among corresponding field values of the hierarchical database records;

determining parent-child hierarchical relationships among the hierarchical database records;

associating related hierarchical database records by;

determining highest compelling linkages among the hierarchical database records, the determining comprising;

identifying mutually preferred pairs of records from the hierarchical database records, each mutually preferred pair of records consisting of a first record and a second record, the first record consisting of a preferred record associated with the second record and the second record consisting of a preferred record associated with the first record, wherein the mutually preferred pairs of records each has a match score that meets pre-specified match criteria;

assigning, for each record from the hierarchical database records, at least one associated preferred record, wherein a match value assigned to a given record together with its associated preferred record is at least as great as a match value assigned to the record together with any other record in the database records; and

forming and storing a plurality of entity representations in the database, each entity representation of the plurality of entity representations comprising at least one linked pair of mutually preferred records;

applying a hierarchal directional linking process, the hierarchal directional linking process comprising selecting and applying at least an upward process based on the determined parent-child hierarchical relationship wherein the upward process comprises;

determining, from the parent-child hierarchical relationships, similarity among a plurality of child records having initial separate parent records;

in response to determining a threshold similarity among the plurality of child records, inferring that the initial separate parent records correspond to the same entity; and

linking, responsive to the inferring, the initial separate parent records as inferred common parent records;

re-clustering at least a portion of the database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the associating related hierarchical database records and on the determining similarity among corresponding field values of the database records; and

outputting database record information, based at least in part on the re-clustering.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Certain implementations of the disclosed technology include systems and methods for internal co-convergence using clustering when there is hierarchy in the data structure. A method is included for clustering hierarchical database records into a first set of clusters having corresponding first cluster identifications (IDs), each hierarchical database record including one or more field values, the clustering based at least in part on determining similarity among corresponding field values of the hierarchical database records. The method includes receiving parent-child hierarchical relationship information for the hierarchical database records, re-clustering at least a portion of the hierarchical database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the received parent-child hierarchical relationship information, and outputting hierarchical database record information, based at least in part on the re-clustering.

160 Citations

20 Claims

1. A computer-implemented method comprising:
- clustering hierarchical database records into a first set of clusters having corresponding first cluster identifications (IDs), each hierarchical database record comprising one or more field values, the clustering based at least in part on determining similarity among corresponding field values of the hierarchical database records;
  
  determining parent-child hierarchical relationships among the hierarchical database records;
  
  associating related hierarchical database records by;
  
  determining highest compelling linkages among the hierarchical database records, the determining comprising;
  
  identifying mutually preferred pairs of records from the hierarchical database records, each mutually preferred pair of records consisting of a first record and a second record, the first record consisting of a preferred record associated with the second record and the second record consisting of a preferred record associated with the first record, wherein the mutually preferred pairs of records each has a match score that meets pre-specified match criteria;
  
  assigning, for each record from the hierarchical database records, at least one associated preferred record, wherein a match value assigned to a given record together with its associated preferred record is at least as great as a match value assigned to the record together with any other record in the database records; and
  
  forming and storing a plurality of entity representations in the database, each entity representation of the plurality of entity representations comprising at least one linked pair of mutually preferred records;
  
  applying a hierarchal directional linking process, the hierarchal directional linking process comprising selecting and applying at least an upward process based on the determined parent-child hierarchical relationship wherein the upward process comprises;
  
  determining, from the parent-child hierarchical relationships, similarity among a plurality of child records having initial separate parent records;
  
  in response to determining a threshold similarity among the plurality of child records, inferring that the initial separate parent records correspond to the same entity; and
  
  linking, responsive to the inferring, the initial separate parent records as inferred common parent records;
  
  re-clustering at least a portion of the database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the associating related hierarchical database records and on the determining similarity among corresponding field values of the database records; and
  
  outputting database record information, based at least in part on the re-clustering.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the hierarchal directional linking process further comprises selecting and applying a downward process comprising linking two or more records on a given hierarchy level based at least in part on the two records sharing the inferred common parent records.
  - 3. The method of claim 1, wherein determining the similarity among the corresponding field values of the database records comprises:
    - assigning a hyperspace attribute to each database record, wherein the hyperspace attribute corresponding to two database records is correlated with a similarity of the corresponding field values of the two database records;
      
      determining membership of each database record in a plurality of hyperspace clusters based at least in part on the hyperspace attributes;
      
      assigning, to each record, a cluster ID and a match value reflecting a likelihood that the record is a member of a particular hyperspace cluster; and
      
      linking related records based at least in part on the cluster ID and the match value.
  - 4. The method of claim 3, further comprising merging database records having hyperspace attribute differences within a predefined criteria to eliminate similar exemplars that are likely to represent a same entity, the merging resulting in a reduced set of database records.
  - 5. The method of claim 4, further comprising:
    - recalculating the field value weights for the reduced set of database records; and
      
      re-clustering the reduced set of records based at least in part on the recalculated field value weights.
  - 6. The method of claim 3, wherein the determining membership of each database record in the plurality of hyperspace clusters further comprises creating a plurality of nodes at random locations in hyperspace, each node maintaining records in hyperspace based on the hyperspace attribute for which it is the closest node.
  - 7. The method of claim 1, wherein each hierarchical database record corresponds to an entity representation, each hierarchical database record comprising a plurality of fields, each field configured to contain a field value, and each field value assigned a field value weight corresponding to a specificity of the field value in relation to all field values in a corresponding field of the records.

8. A computer-implemented method comprising:
- clustering hierarchical database records into a first set of clusters having corresponding first cluster identifications (IDs), each hierarchical database record comprising one or more field values, the clustering based at least in part on determining similarity among corresponding field values of the hierarchical database records;
  
  determining highest compelling linkages among the hierarchical database records, the determining comprising;
  
  identifying mutually preferred pairs of records from the hierarchical database records, each mutually preferred pair of records consisting of a first record and a second record, the first record consisting of a preferred record associated with the second record and the second record consisting of a preferred record associated with the first record, wherein the mutually preferred pairs of records each has a match score that meets pre-specified match criteria;
  
  assigning, for each record from the database records, at least one associated preferred record, wherein a match value assigned to a given record together with its associated preferred record is at least as great as a match value assigned to the record together with any other record in the hierarchical database records; and
  
  forming and storing a plurality of entity representations in the database, each entity representation of the plurality of entity representations comprising at least one linked pair of mutually preferred records;
  
  receiving parent-child hierarchical relationship information for the hierarchical database records;
  
  re-clustering at least a portion of the hierarchical database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the received parent-child hierarchical relationship information; and
  
  outputting hierarchical database record information, based at least in part on the re-clustering.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The method of claim 8, wherein determining the similarity among the corresponding field values of the hierarchical database records comprises:
    - assigning a hyperspace attribute to each hierarchical database record, wherein the hyperspace attribute corresponding to two hierarchical database records is correlated with a similarity of the corresponding field values of the two hierarchical database records;
      
      determining membership of each hierarchical database record in a plurality of hyperspace clusters based at least in part on the hyperspace attributes;
      
      assigning, to each record, a cluster ID and a match value reflecting a likelihood that the record is a member of a particular hyperspace cluster; and
      
      linking related records based at least in part on the cluster ID and the match value.
  - 10. The method of claim 9, further comprising merging hierarchical database records having hyperspace attribute differences within a predefined criteria to eliminate similar exemplars that are likely to represent a same entity, the merging resulting in a reduced set of hierarchical database records.
  - 11. The method of claim 10, further comprising:
    - recalculating the field value weights for the reduced set of hierarchical database records; and
      
      re-clustering the reduced set of records based at least in part on the recalculated field value weights.
  - 12. The method of claim 8, wherein each hierarchical database record corresponds to an entity representation, each database record comprising a plurality of fields, each field configured to contain a field value, and each field value assigned a field value weight corresponding to a specificity of the field value in relation to all field values in a corresponding field of the records.

13. A system comprising:
- at least one memory for storing data and computer-executable instructions; and
  
  at least one processor configured to access the at least one memory and further configured to execute the computer-executable instructions for;
  
  clustering hierarchical database records into a first set of clusters having corresponding first cluster identifications (IDs), each hierarchical database record comprising one or more field values, the clustering based at least in part on determining similarity among corresponding field values of the hierarchical database records;
  
  when a hierarchy structure of the hierarchical database records is unavailable;
  
  determining parent-child hierarchical relationships among the hierarchical database records;
  
  associating related hierarchical database records by;
  
  determining highest compelling linkages among the hierarchical database records, the determining comprising;
  
  identifying mutually preferred pairs of records from the hierarchical database records, each mutually preferred pair of records consisting of a first record and a second record, the first record consisting of a preferred record associated with the second record and the second record consisting of a preferred record associated with the first record, wherein the mutually preferred pairs of records each has a match score that meets pre-specified match criteria;
  
  assigning, for each record from the hierarchical database records, at least one associated preferred record, wherein a match value assigned to a given record together with its associated preferred record is at least as great as a match value assigned to the record together with any other record in the database records; and
  
  forming and storing a plurality of entity representations in the database, each entity representation of the plurality of entity representations comprising at least one linked pair of mutually preferred records;
  
  applying a hierarchal directional linking process, the hierarchal directional linking process comprising selecting and applying at least an upward process based on the determined parent-child hierarchical relationship wherein the upward process comprises;
  
  determining, from the parent-child hierarchical relationships, similarity among a plurality of child records having initial separate parent records;
  
  in response to determining a threshold similarity among that the plurality of child records, inferring that the initial separate parent records correspond to the same entity; and
  
  linking, responsive to the inferring, the initial separate parent records as inferred common parent records;
  
  re-clustering at least a portion of the hierarchical database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the associating related hierarchical database records and on the determining similarity among corresponding field values of the database records; and
  
  when a hierarchy structure of the hierarchical database records is available;
  
  receiving parent-child hierarchical relationship information for the hierarchical database records;
  
  re-clustering at least a portion of the hierarchical database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the received parent-child hierarchical relationship information; and
  
  outputting hierarchical database record information, based at least in part on the re-clustering.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The system of claim 13, wherein the hierarchal directional linking process further comprises selecting and applying a downward process comprising linking two or more records on a given hierarchy level based at least in part on the two records sharing the inferred common parent records.
  - 15. The system of claim 13, wherein determining the similarity among the corresponding field values of the hierarchal database records comprises:
    - assigning a hyperspace attribute to each hierarchal database record, wherein the hyperspace attribute corresponding to two hierarchal database records is correlated with a similarity of the corresponding field values of the two hierarchal database records;
      
      determining membership of each hierarchal database record in a plurality of hyperspace clusters based at least in part on the hyperspace attributes;
      
      assigning, to each record, a cluster ID and a match value reflecting a likelihood that the record is a member of a particular hyperspace cluster; and
      
      linking related records based at least in part on the cluster ID and the match value.
  - 16. The system of claim 15, further comprising merging hierarchal database records having hyperspace attribute differences within a predefined criteria to eliminate similar exemplars that are likely to represent a same entity, the merging resulting in a reduced set of hierarchal database records.
  - 17. The system of claim 16, further comprising:
    - recalculating the field value weights for the reduced set of hierarchal database records; and
      
      re-clustering the reduced set of records based at least in part on the recalculated field value weights.
  - 18. The system of claim 15, wherein the determining membership of each database record in the plurality of hyperspace clusters further comprises creating a plurality of nodes at random locations in hyperspace, each node maintaining records in hyperspace based on the hyperspace attribute for which it is the closest node.
  - 19. The system of claim 13, wherein each hierarchical database record corresponds to an entity representation, each hierarchical database record comprising a plurality of fields, each field configured to contain a field value, and each field value assigned a field value weight corresponding to a specificity of the field value in relation to all field values in a corresponding field of the records.

20. A non-transitory computer readable media comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform a method comprising:
- clustering hierarchical database records into a first set of clusters having corresponding first cluster identifications (IDs), each hierarchical database record comprising one or more field values, the clustering based at least in part on determining similarity among corresponding field values of the hierarchical database records;
  
  when a hierarchy structure of the hierarchical database records is unavailable;
  
  determining parent-child hierarchical relationships among the hierarchical database records;
  
  associating related hierarchical database records by;
  
  determining highest compelling linkages among the hierarchical database records, the determining comprising;
  
  identifying mutually preferred pairs of records from the hierarchical database records, each mutually preferred pair of records consisting of a first record and a second record, the first record consisting of a preferred record associated with the second record and the second record consisting of a preferred record associated with the first record, wherein the mutually preferred pairs of records each has a match score that meets pre-specified match criteria;
  
  assigning, for each record from the hierarchical database records, at least one associated preferred record, wherein a match value assigned to a given record together with its associated preferred record is at least as great as a match value assigned to the record together with any other record in the database records; and
  
  forming and storing a plurality of entity representations in the database, each entity representation of the plurality of entity representations comprising at least one linked pair of mutually preferred records;
  
  applying a hierarchal directional linking process, the hierarchal directional linking process comprising selecting and applying at least an upward process based on the determined parent-child hierarchical relationship wherein the upward process comprises;
  
  determining, from the parent-child hierarchical relationships, similarity among a plurality of child records having separate parent records; and
  
  in response to determining a threshold similarity among that the plurality of child records, inferring that the separate parent records correspond to the same entity;
  
  re-clustering at least a portion of the hierarchical database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the associating related hierarchical database records and on the determining similarity among corresponding field values of the database records; and
  
  when a hierarchy structure of the hierarchical database records is available;
  
  receiving parent-child hierarchical relationship information for the hierarchical database records;
  
  re-clustering at least a portion of the hierarchical database records into a second set of clusters having corresponding second cluster IDs, the re-clustering based at least in part on the received parent-child hierarchical relationship information; and
  
  outputting hierarchical database record information, based at least in part on the re-clustering.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LexisNexis Risk Solutions, Inc. (RELX PLC)
Original Assignee
LexisNexis Risk Solutions, Inc. (RELX PLC)
Inventors
Bayliss, David Alan
Primary Examiner(s)
LEROUX, ETIENNE PIERRE

Application Number

US14/029,710
Publication Number

US 20140032557A1
Time in Patent Office

609 Days
Field of Search

707/1, 707/100, 707/770, 709204-206
US Class Current

707/770
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/2246   Trees, e.g. B+trees

G06F 16/24   Querying

G06F 16/245   Query processing

G06F 16/2455   Query execution

G06F 16/282   Hierarchical databases, e.g...

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/951   Indexing; Web crawling tech...

Internal linking co-convergence using clustering with hierarchy

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

160 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Internal linking co-convergence using clustering with hierarchy

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

160 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links