Data clustering based on candidate queries

US 10,572,511 B2
Filed: 06/02/2016
Issued: 02/25/2020
Est. Priority Date: 11/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method, including:

receiving data records, the received data records each including one or more values in one or more fields; and

processing the received data records to identify a matched data cluster to associate with each received data record, the processing including;

for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values;

identifying a first set of one or more candidate data records from the received data records using the first query;

identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records;

determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records;

determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and

selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Received data records, each including one or more values in one or more fields, are processed to identify a matched data cluster. The processing includes: for selected data records, generating a query from one or more values; identifying one or more candidate data records from the received data records using the query; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records; and selecting the matched data cluster from among one or more candidate data clusters based at least in part on a growth criterion for the candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters or based on a result of the growth criterion.

Citations

21 Claims

1. A method, including:
- receiving data records, the received data records each including one or more values in one or more fields; and
  
  processing the received data records to identify a matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values;
  
  identifying a first set of one or more candidate data records from the received data records using the first query;
  
  identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records;
  
  determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records;
  
  determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and
  
  selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein the first query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 3. The method of claim 2, wherein the variant relationship is based at least in part on an edit distance.
  - 4. The method of claim 1, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 5. The method of claim 4, further including generating a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.
  - 6. The method of claim 1, wherein the processing further includes sorting at least an initial set of the received data records based on a distinguishability criterion that determines a degree to which one or more values included in a particular data record are able to distinguish that particular data record from other data records.
  - 7. The method of claim 6, wherein the selected data records from the received data records include selected data records from the sorted set of data records.
  - 8. The method of claim 6, wherein the distinguishability criterion is based on at least one of:
    - a number of fields that are populated with a value, or number of tokens in one or more fields.
  - 9. The method of claim 1, wherein selecting the matched data cluster includes:
    - calculating a comparison score by comparing the selected data record to at least one representative data record for an existing data cluster; and
      
      selecting the existing data cluster as the matched data cluster in response to determining that the comparison score exceeds a first threshold.
  - 10. The method of claim 9, further including:
    - comparing the comparison score to a second threshold; and
      
      initializing the matched data cluster with the selected data record in response to determining that the comparison score does not exceed the second threshold.
  - 11. The method of claim 1, wherein selecting the matched data cluster from among one or more existing data clusters includes selecting the matched data cluster from among multiple candidate data clusters for which the selected data record satisfies a cluster membership criterion.
  - 12. The method of claim 11, further including storing information identifying one or more candidate data clusters that were not selected as the matched data cluster for the selected data record.
  - 13. The method of claim 1, wherein identifying candidate data records includes comparing the first query to a data store mapping queries to candidate clusters including an entry mapping the first query to a first cluster.
  - 14. The method of claim 13, further including:
    - receiving a request to map the selected data record to a second cluster; and
      
      updating the data store to map the query to the second cluster.
  - 15. The method of claim 13, further including:
    - receiving a request to map the data record to a new cluster;
      
      updating the data store with a new cluster indicator;
      
      generating a new cluster; and
      
      assigning the selected data record to the new cluster.
  - 16. The method of claim 13, further including:
    - receiving a request to confirm membership of the selected data record in the first cluster; and
      
      storing information in the data store so that updates of the data store in response to requests associated with other data records do not change membership of the selected data record in the first membership cluster.
  - 17. The method of claim 13, further including:
    - receiving a request to exclude membership of the selected data record in the first cluster;
      
      updating the data store to change membership of the selected data record; and
      
      storing information in the data store so that updates of the data store in response to requests associated with other data records do not allow membership of the selected data record in the first membership cluster.
  - 18. The method of claim 13, further including receiving input from a user to approve or modify association of received data records to matched data clusters.

19. A computer program stored on a non-transitory computer-readable storage medium, the computer program including instructions for causing a computing system to:
- receive data records, the received data records each including one or more values in one or more fields; and
  
  process the received data records to identify a matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values;
  
  identifying a first set of one or more candidate data records from the received data records using the first query;
  
  identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records;
  
  determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records;
  
  determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and
  
  selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.

20. A computing system, including:
- an input device or port configured to receive data records, the received data records each including one or more values in one or more fields; and
  
  at least one processor, coupled to a memory, and configured to process the received data records to identify a matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values;
  
  identifying a first set of one or more candidate data records from the received data records using the first query;
  
  identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records;
  
  determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records;
  
  determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and
  
  selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.

21. A computing system, including:
- means for receiving data records, the received data records each including one or more values in one or more fields; and
  
  means for processing the received data records to identify a matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a first query from a first set of one or more values included in the selected data record including identifying tokens that each include a representation of at least one value or fragment of a value in a field or a combination of fields of the selected record and generating a second query from a second set of one or more values included in the selected data record, where the second set of one or more values is different from the first set of one or more values;
  
  identifying a first set of one or more candidate data records from the received data records using the first query;
  
  identifying a second set of one or more candidate data records from the received data records using the second query, the second set of one or more candidate data records partially overlapping the first set of one or more candidate data records;
  
  determining a third set of one or more candidate data records as a Boolean combination of the first set of one or more candidate data records and the second set of one or more candidate data records;
  
  determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing one or more candidate data records from at least one of the first set of one or more candidate data records or the second set of one or more candidate data records, the determining including applying the cluster membership criterion to the third set of one or more candidate data records; and
  
  selecting the matched data cluster from among one or more candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ab Initio Technology LLC (Ab Initio Software Corporation)
Original Assignee
Ab Initio Technology LLC (Ab Initio Software Corporation)
Inventors
Anderson, Arlen, Trojan, Kamil
Primary Examiner(s)
Gofman, Alex
Assistant Examiner(s)
Mian, Umar

Application Number

US15/171,168
Publication Number

US 20160283574A1
Time in Patent Office

1,363 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/20   of structured data, e.g. re...

G06F 16/24534   Query rewriting; Transforma...

G06F 16/278   Data partitioning, e.g. hor...

G06F 16/285   Clustering or classification

G06F 16/3338   Query expansion

Data clustering based on candidate queries

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Data clustering based on candidate queries

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links