Data clustering based on candidate queries

US 9,361,355 B2
Filed: 11/15/2012
Issued: 06/07/2016
Est. Priority Date: 11/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method, including:

receiving data records, the received data records each including one or more values in one or more fields; and

processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;

for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query;

identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure;

determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster;

determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and

selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Received data records, each including one or more values in one or more fields, are processed to identify a matched data cluster. The processing includes: for selected data records, generating a query from one or more values; identifying one or more candidate data records from the received data records using the query; determining whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records; and selecting the matched data cluster from among one or more candidate data clusters based at least in part on a growth criterion for the candidate data clusters, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy a cluster membership criterion for any of the existing data clusters or based on a result of the growth criterion.

90 Citations

View as Search Results

101 Claims

1. A method, including:
- receiving data records, the received data records each including one or more values in one or more fields; and
  
  processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query;
  
  identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster;
  
  determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and
  
  selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26, 27, 28)
- - 2. The method of claim 1, wherein generating the query includes identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the selected data record.
  - 3. The method of claim 2, wherein the query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 4. The method of claim 3, wherein the variant relationship is based at least in part on an edit distance.
  - 5. The method of claim 2, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 6. The method of claim 5, further including generating a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.
  - 7. The method of claim 1, wherein the processing further includes sorting at least an initial set of the received data records based on a distinguishability criterion that determines a degree to which one or more values included in a particular data record are able to distinguish that particular data record from other data records.
  - 8. The method of claim 7, wherein the at least one selected data record from the received data records includes a plurality of selected data records from the sorted set of data records.
  - 9. The method of claim 7, wherein the distinguishability criterion is based on at least one of:
    - a number of fields that are populated with a value, or number of tokens in one or more fields.
  - 10. The method of claim 1, wherein selecting the matched data cluster from among one or more candidate data clusters includes:
    - calculating a comparison score by comparing the selected data record to at least one data record that is a previously added member of a candidate data cluster; and
      
      selecting the candidate data cluster as the matched data cluster in response to determining that the comparison score indicates that the selected data record is within the second threshold of the previously added member of the candidate data cluster, and the growth criterion indicates that the selected data record is within the first threshold of a predetermined member of the candidate data cluster.
  - 11. The method of claim 10, wherein initializing the matched data cluster with the selected data record includes:
    - determining that the growth criterion indicates that the selected data record is not within the first threshold of the predetermined member of the candidate data cluster.
  - 12. The method of claim 1, wherein selecting the matched data cluster from among one or more existing data clusters includes selecting the matched data cluster from among multiple candidate data clusters for which the selected data record satisfies a cluster membership criterion.
  - 13. The method of claim 12, further including storing information identifying one or more candidate data clusters that were not selected as the matched data cluster for the selected data record.
  - 14. The method of claim 1, wherein identifying candidate data records includes comparing the query to a data store mapping queries to candidate clusters including an entry mapping the query to a first cluster.
  - 15. The method of claim 14, further including:
    - receiving a request to map the selected data record to a second cluster; and
      
      updating the data store to map the query to the second cluster.
  - 16. The method of claim 14, further including:
    - receiving a request to map the data record to a new cluster;
      
      updating the data store with a new cluster indicator;
      
      generating a new cluster; and
      
      assigning the selected data record to the new cluster.
  - 17. The method of claim 14, further including:
    - receiving a request to confirm membership of the selected data record in the first cluster; and
      
      storing information in the data store so that updates of the data store in response to requests associated with other data records do not change membership of the selected data record in the first membership cluster.
  - 18. The method of claim 14, further including:
    - receiving a request to exclude membership of the selected data record in the first cluster;
      
      updating the data store to change membership of the selected data record; and
      
      storing information in the data store so that updates of the data store in response to requests associated with other data records do not allow membership of the selected data record in the first membership cluster.
  - 19. The method of claim 14, further including receiving input from a user to approve or modify association of received data records to matched data clusters.
  - 23. The method of claim 1, wherein the growth criterion limits growth of the clusters such that data records that are members of a first candidate data cluster are within the first threshold of a predetermined member of the first candidate data cluster.
  - 24. The method of claim 23, wherein the cluster membership criterion indicates that data records that are members of the first candidate data cluster are within the second threshold of at least one previously added member of the first candidate data cluster.
  - 25. The method of claim 24, wherein the first threshold is different from the second threshold.
  - 26. The method of claim 1, wherein the processing further includes:
    - for a plurality of tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records, storing, within entries in a search store each associated with at least one respective token of the plurality of tokens, location information identifying at least some of the received data records that correspond to said at least one respective token.
  - 27. The method of claim 26, the processing further includes:
    - forming one or more search codes, each search code encoding a result of a search for a combination of tokens from multiple entries in the search store.
  - 28. The method of claim 27, wherein identifying, in the first comparison, one or more candidate data records using the query and an approximate distance measure further includes:
    - retrieving the one or more candidate data records from the received data records using a final location information result determined from the location information stored in multiple entries in the search store corresponding to at least one of the search codes corresponding to the query.

20. A computer program stored on a non-transitory computer-readable medium, the computer program including instructions for causing a computing system to:
- receive data records, the received data records each including one or more values in one or more fields; and
  
  process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query;
  
  identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster;
  
  determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and
  
  selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.
- View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
- - 37. The computer program of claim 20, wherein generating the query includes identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the selected data record.
  - 38. The computer program of claim 37, wherein the query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 39. The computer program of claim 38, wherein the variant relationship is based at least in part on an edit distance.
  - 40. The computer program of claim 37, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 41. The computer program of claim 40, further including instructions for causing a computing system to generate a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.
  - 42. The computer program of claim 20, wherein the processing further includes sorting at least an initial set of the received data records based on a distinguishability criterion that determines a degree to which one or more values included in a particular data record are able to distinguish that particular data record from other data records.
  - 43. The computer program of claim 42, wherein the at least one selected data record from the received data records includes a plurality of selected data records from the sorted set of data records.
  - 44. The computer program of claim 42, wherein the distinguishability criterion is based on at least one of:
    - a number of fields that are populated with a value, or number of tokens in one or more fields.
  - 45. The computer program of claim 20, wherein selecting the matched data cluster from among one or more candidate data clusters includes:
    - calculating a comparison score by comparing the selected data record to at least one data record that is a previously added member of a candidate data cluster; and
      
      selecting the candidate data cluster as the matched data cluster in response to determining that the comparison score indicates that the selected data record is within the second threshold of the previously added member of the candidate data cluster, and the growth criterion indicates that the selected data record is within the first threshold of a predetermined member of the candidate data cluster.
  - 46. The computer program of claim 45, wherein initializing the matched data cluster with the selected data record includes:
    - determining that the growth criterion indicates that the selected data record is not within the first threshold of the predetermined member of the candidate data cluster.
  - 47. The computer program of claim 20, wherein selecting the matched data cluster from among one or more existing data clusters includes selecting the matched data cluster from among multiple candidate data clusters for which the selected data record satisfies a cluster membership criterion.
  - 48. The computer program of claim 47, further including instructions for causing a computing system to store information identifying one or more candidate data clusters that were not selected as the matched data cluster for the selected data record.
  - 49. The computer program of claim 20, wherein identifying candidate data records includes comparing the query to a data store mapping queries to candidate clusters including an entry mapping the query to a first cluster.
  - 50. The computer program of claim 49, further including instructions for causing a computing system to:
    - receive a request to map the selected data record to a second cluster; and
      
      update the data store to map the query to the second cluster.
  - 51. The computer program of claim 49, further including instructions for causing a computing system to:
    - receive a request to map the data record to a new cluster;
      
      update the data store with a new cluster indicator;
      
      generate a new cluster; and
      
      assign the selected data record to the new cluster.
  - 52. The computer program of claim 49, further including instructions for causing a computing system to:
    - receive a request to confirm membership of the selected data record in the first cluster; and
      
      store information in the data store so that updates of the data store in response to requests associated with other data records do not change membership of the selected data record in the first membership cluster.
  - 53. The computer program of claim 49, further including instructions for causing a computing system to:
    - receive a request to exclude membership of the selected data record in the first cluster;
      
      update the data store to change membership of the selected data record; and
      
      storing information in the data store so that updates of the data store in response to requests associated with other data records do not allow membership of the selected data record in the first membership cluster.
  - 54. The computer program of claim 49, further including instructions for causing a computing system to receive input from a user to approve or modify association of received data records to matched data clusters.
  - 55. The computer program of claim 20, wherein the growth criterion limits growth of the clusters such that data records that are members of a first candidate data cluster are within the first threshold of a predetermined member of the first candidate data cluster.
  - 56. The computer program of claim 55, wherein the cluster membership criterion indicates that data records that are members of the first candidate data cluster are within the second threshold of at least one previously added member of the first candidate data cluster.
  - 57. The computer program of claim 56, wherein the first threshold is different from the second threshold.
  - 58. The computer program of claim 20, wherein the processing further includes:
    - for a plurality of tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records, storing, within entries in a search store each associated with at least one respective token of the plurality of tokens, location information identifying at least some of the received data records that correspond to said at least one respective token.
  - 59. The computer program of claim 58, the processing further includes:
    - forming one or more search codes, each search code encoding a result of a search for a combination of tokens from multiple entries in the search store.
  - 60. The computer program of claim 59, wherein identifying, in the first comparison, one or more candidate data records using the query and an approximate distance measure further includes:
    - retrieving the one or more candidate data records from the received data records using a final location information result determined from the location information stored in multiple entries in the search store corresponding to at least one of the search codes corresponding to the query.

21. A computing system, including:
- an input device or port configured to receive data records, the received data records each including one or more values in one or more fields; and
  
  at least one processor coupled to memory storing at least some data records, the processor configured to process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query;
  
  identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster;
  
  determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and
  
  selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.
- View Dependent Claims (61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84)
- - 61. The computing system of claim 21, wherein generating the query includes identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the selected data record.
  - 62. The computing system of claim 61, wherein the query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 63. The computing system of claim 62, wherein the variant relationship is based at least in part on an edit distance.
  - 64. The computing system of claim 61, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 65. The computing system of claim 64, wherein the processor is further configured to generate a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.
  - 66. The computing system of claim 21, wherein the processing further includes sorting at least an initial set of the received data records based on a distinguishability criterion that determines a degree to which one or more values included in a particular data record are able to distinguish that particular data record from other data records.
  - 67. The computing system of claim 66, wherein the at least one selected data record from the received data records includes a plurality of selected data records from the sorted set of data records.
  - 68. The computing system of claim 66, wherein the distinguishability criterion is based on at least one of:
    - a number of fields that are populated with a value, or number of tokens in one or more fields.
  - 69. The computing system of claim 21, wherein selecting the matched data cluster from among one or more candidate data clusters includes:
    - calculating a comparison score by comparing the selected data record to at least one data record that is a previously added member of a candidate data cluster; and
      
      selecting the candidate data cluster as the matched data cluster in response to determining that the comparison score indicates that the selected data record is within the second threshold of the previously added member of the candidate data cluster, and the growth criterion indicates that the selected data record is within the first threshold of a predetermined member of the candidate data cluster.
  - 70. The computing system of claim 69, wherein initializing the matched data cluster with the selected data record includes:
    - determining that the growth criterion indicates that the selected data record is not within the first threshold of the predetermined member of the candidate data cluster.
  - 71. The computing system of claim 21, wherein selecting the matched data cluster from among one or more existing data clusters includes selecting the matched data cluster from among multiple candidate data clusters for which the selected data record satisfies a cluster membership criterion.
  - 72. The computing system of claim 71, wherein the processor is further configured to store information identifying one or more candidate data clusters that were not selected as the matched data cluster for the selected data record.
  - 73. The computing system of claim 21, wherein identifying candidate data records includes comparing the query to a data store mapping queries to candidate clusters including an entry mapping the query to a first cluster.
  - 74. The computing system of claim 73, wherein the processor is further configured to:
    - receive a request to map the selected data record to a second cluster; and
      
      update the data store to map the query to the second cluster.
  - 75. The computing system of claim 73, wherein the processor is further configured to:
    - receive a request to map the data record to a new cluster;
      
      update the data store with a new cluster indicator;
      
      generate a new cluster; and
      
      assign the selected data record to the new cluster.
  - 76. The computing system of claim 73, wherein the processor is further configured to:
    - receive a request to confirm membership of the selected data record in the first store information in the data store so that updates of the data store in response to requests associated with other data records do not change membership of the selected data record in the first membership cluster.
  - 77. The computing system of claim 73, wherein the processor is further configured to:
    - receive a request to exclude membership of the selected data record in the first cluster;
      
      update the data store to change membership of the selected data record; and
      
      store information in the data store so that updates of the data store in response to requests associated with other data records do not allow membership of the selected data record in the first membership cluster.
  - 78. The computing system of claim 73, wherein the processor is further configured to receive input from a user to approve or modify association of received data records to matched data clusters.
  - 79. The computing system of claim 21, wherein the growth criterion limits growth of the clusters such that data records that are members of a first candidate data cluster are within the first threshold of a predetermined member of the first candidate data cluster.
  - 80. The computing system of claim 79, wherein the cluster membership criterion indicates that data records that are members of the first candidate data cluster are within the second threshold of at least one previously added member of the first candidate data cluster.
  - 81. The computing system of claim 80, wherein the first threshold is different from the second threshold.
  - 82. The computing system of claim 21, wherein the processing further includes:
    - for a plurality of tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records, storing, within entries in a search store each associated with at least one respective token of the plurality of tokens, location information identifying at least some of the received data records that correspond to said at least one respective token.
  - 83. The computing system of claim 82, the processing further includes:
    - forming one or more search codes, each search code encoding a result of a search for a combination of tokens from multiple entries in the search store.
  - 84. The computing system of claim 83, wherein identifying, in the first comparison, one or more candidate data records using the query and an approximate distance measure further includes:
    - retrieving the one or more candidate data records from the received data records using a final location information result determined from the location information stored in multiple entries in the search store corresponding to at least one of the search codes corresponding to the query.

22. A computing system, including:
- means for receiving data records, the received data records each including one or more values in one or more fields; and
  
  means for processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, a second comparison, and a third comparison using the generated query;
  
  identifying, in the first comparison, one or more candidate data records from the received data records using the query and an approximate distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies a growth criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records, wherein the growth criterion is different from any cluster membership criterion for any candidate data cluster and uses the query and a first threshold associated with a boundary around a respective predetermined member of a candidate data cluster;
  
  determining, in the third comparison performed after the second comparison, whether or not the selected data record satisfies a cluster membership criterion for at least one candidate data cluster of one or more existing data clusters containing the candidate records using the query and a second threshold associated with a detailed distance measure more accurate than the approximate distance measure; and
  
  selecting the matched data cluster from among one or more candidate data clusters if the selected data record satisfies both the cluster membership criterion and the growth criterion for the matched data cluster, or initializing the matched data cluster with the selected data record if the selected data record does not satisfy the growth criterion for any of the existing data clusters or if the selected data record does satisfy the growth criterion for at least one of the existing data clusters but does not satisfy a cluster membership criterion for any of the existing data clusters.

29. A method, including:
- receiving data records, the received data records each including one or more values in one or more fields;
  
  processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query;
  
  identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and
  
  determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and
  
  receiving, in a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster.
- View Dependent Claims (30, 31, 32, 33, 34, 35, 36)
- - 30. The method of claim 29, wherein the at least two matched data clusters are equal matches to the query according to the second distance measure.
  - 31. The method of claim 30, wherein the second distance measure is more accurate than the first distance measure.
  - 32. The method of claim 29, wherein generating the query includes identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the selected data record.
  - 33. The method of claim 32, wherein the query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 34. The method of claim 33, wherein the variant relationship is based at least in part on an edit distance.
  - 35. The method of claim 32, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 36. The method of claim 35, further including generating a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.

85. A computer program stored on a non-transitory computer-readable medium, the computer program including instructions for causing a computing system to:
- receive data records, the received data records each including one or more values in one or more fields;
  
  process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query;
  
  identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and
  
  determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and
  
  receive, in a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster.
- View Dependent Claims (86, 87, 88, 89, 90, 91, 92)
- - 86. The computer program of claim 85, wherein the second distance measure is more accurate than the first distance measure.
  - 87. The computer program of claim 86, wherein the at least two matched data clusters are equal matches to the query according to the second distance measure.
  - 88. The computer program of claim 85, wherein generating the query includes identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the selected data record.
  - 89. The computer program of claim 88, wherein the query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 90. The computer program of claim 89, wherein the variant relationship is based at least in part on an edit distance.
  - 91. The computer program of claim 88, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 92. The computer program of claim 91, further including instructions for causing a computing system to generate a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.

93. A computing system, including:
- an input device or port configured to receive data records, the received data records each including one or more values in one or more fields;
  
  at least one processor coupled to memory storing at least some data records, the processor configured to process the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query;
  
  identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and
  
  determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and
  
  a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, configured to receive user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster.
- View Dependent Claims (94, 95, 96, 97, 98, 99, 100)
- - 94. The computing system of claim 93, wherein the second distance measure is more accurate than the first distance measure.
  - 95. The computing system of claim 94, wherein the at least two matched data clusters are equal matches to the query according to the second distance measure.
  - 96. The computing system of claim 93, wherein generating the query includes identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the selected data record.
  - 97. The computing system of claim 96, wherein the query includes the tokens identified from the selected data record, and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the selected data record.
  - 98. The computing system of claim 97, wherein the variant relationship is based at least in part on an edit distance.
  - 99. The computing system of claim 96, wherein identifying candidate data records includes looking up the identified tokens in a data store, the data store mapping stored tokens to candidate data records or existing data clusters containing candidate data records.
  - 100. The computing system of claim 99, wherein the processor is further configured to generate a set of stored tokens mapped to a candidate data record based on tokens identified from the candidate data record and tokens that were identified from other received data records and that have a variant relationship to the tokens identified from the candidate data record.

101. A computing system, including:
- means for receiving data records, the received data records each including one or more values in one or more fields;
  
  means for processing the received data records to identify at least one matched data cluster to associate with each received data record, the processing including;
  
  for at least one selected data record from the received data records, generating a query from the one or more values included in the selected data record and performing at least a first comparison, and a second comparison using the generated query;
  
  identifying, in the first comparison, a plurality of candidate data records from the received data records using the query and a first distance measure;
  
  determining, in the second comparison performed after the first comparison, whether or not the selected data record satisfies cluster membership criteria for a plurality of candidate data clusters of a plurality of existing data clusters containing the candidate records using the query and a threshold associated with a second distance measure different from the first distance measure; and
  
  determining an ambiguous match to at least two matched data clusters for the selected data record, based on the determination of cluster membership for the plurality of candidate data clusters; and
  
  means for receiving, in a user interface displaying results of processing the received data records including displaying an indication of the ambiguous match, user input for resolving the ambiguous match to a single matched data cluster of the at least two matched data clusters for the selected data record or for resolving the ambiguous match to a plurality of matched data clusters with a weight associated with each matched data cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ab Initio Technology LLC (Ab Initio Software Corporation)
Original Assignee
Ab Initio Technology LLC (Ab Initio Software Corporation)
Inventors
Trojan, Kamil, Anderson, Arlen
Primary Examiner(s)
Gofman, Alex

Application Number

US13/678,078
Publication Number

US 20130124525A1
Time in Patent Office

1,300 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/20   of structured data, e.g. re...

G06F 16/24534   Query rewriting; Transforma...

G06F 16/278   Data partitioning, e.g. hor...

G06F 16/285   Clustering or classification

G06F 16/3338   Query expansion

Data clustering based on candidate queries

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

90 Citations

101 Claims

Specification

Solutions

Use Cases

Quick Links

Data clustering based on candidate queries

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

90 Citations

101 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links