Automated database blocking and record matching

US 7,152,060 B2
Filed: 04/11/2003
Issued: 12/19/2006
Est. Priority Date: 04/11/2002
Status: Active Grant

First Claim

Patent Images

1. In a system including a database stored in at least one computer'"'"'s data storage and a data record stored in computer memory, said database comprising plural records, said plural database records comprising data fields, said data record stored in computer memory including data fields, there being a correspondence between at least a subset of data fields of said data record stored in computer memory and at least a subset of the data fields of said records of said database, a method for identifying records in said database which are similar enough to said data record stored in computer memory that they might describe the same person or thing as that described by said data record stored in computer memory, said method comprising:

a. inputting a value which will be used to limit the number of records similar to said data record stored in computer memory to be identified within said database;

b. creating a set of sets of fields in said data record stored in computer memory, where the i^thset of said set of sets of fields is obtained byi. selecting at least one field in said data record, such that the number of records in said database that store the same values as said data record, in all of the corresponding fields of said database is estimated to be fewer than said inputted value, andii. setting the i^thset of said set of sets of fields equal to the set of said selected field(s) in said data record;

c. selecting or constructing a database query to retrieve from said database the records which store the same values as said data record stored in computer memory in all the corresponding fields in at least one set of said set of sets of fields, andd. executing said database query to retrieve said retrieved records from said database into computer memory.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automated blocking technique is used as a first step to find approximate matches in a database. The technique builds a blocking set to be as liberal as possible in retrieving records that match on individual fields or sets of fields while avoiding selection criteria that are predicted to return more than the maximum number of records defining a particular special requirement. The ability to do blocking without extensive manual setup at low cost is highly advantageous especially when using a machine learning based second-stage matching algorithm.

Citations

14 Claims

1. In a system including a database stored in at least one computer'"'"'s data storage and a data record stored in computer memory, said database comprising plural records, said plural database records comprising data fields, said data record stored in computer memory including data fields, there being a correspondence between at least a subset of data fields of said data record stored in computer memory and at least a subset of the data fields of said records of said database, a method for identifying records in said database which are similar enough to said data record stored in computer memory that they might describe the same person or thing as that described by said data record stored in computer memory, said method comprising:
- a. inputting a value which will be used to limit the number of records similar to said data record stored in computer memory to be identified within said database;
  
  b. creating a set of sets of fields in said data record stored in computer memory, where the i^thset of said set of sets of fields is obtained byi. selecting at least one field in said data record, such that the number of records in said database that store the same values as said data record, in all of the corresponding fields of said database is estimated to be fewer than said inputted value, andii. setting the i^thset of said set of sets of fields equal to the set of said selected field(s) in said data record;
  
  c. selecting or constructing a database query to retrieve from said database the records which store the same values as said data record stored in computer memory in all the corresponding fields in at least one set of said set of sets of fields, andd. executing said database query to retrieve said retrieved records from said database into computer memory.
- View Dependent Claims (2, 3, 4, 5, 6, 14)
- - 2. The method of claim 1 wherein said database comprises data records, and wherein each record of said database is first augmented with at least one field-value pair which is a function of said record, to form an augmented database and where the data record stored in computer memory is first augmented with at least one field-value pair which is a function of that data record to form an augmented data record and wherein each of the said set of fields is selected from said augmented data record and the above-mentioned steps are performed using the augmented data record and the augmented database.
  - 3. The method of claim 2 further including eliminating at least some parts of said query which are guaranteed to retrieve a subset of the records in the augmented database that would be retrieved by another part of said query.
  - 4. The method of claim 3 further including retrieving information from a data structure containing counts of the number of occurrences of a subset of the distinct field-value pairs in the augmented database.
  - 5. The method of claim 1 wherein said retrieved records are passed to a matching algorithm that assigns to each retrieved record a decision on whether said retrieved record matches said data record stored in computer memory, whether said retrieved record does not match said data record stored in computer memory, or whether said matching algorithm determines that it is ambiguous whether said retrieved record matches said data record stored in computer memory.
  - 6. The method of claim 2 further including forming said query by searching at least some possible subsets of the augmented data record.
  - 14. The method of claim 1 where said estimate in step b.i. is obtained by accessing a prestored frequency table giving frequencies for a subset of the values found in said database, said frequency table having been collected in a computer'"'"'s data store prior to said data record stored in computer memory being known.

7. A method for identifying records in a database which are likely matches to a set of at least one field-value pairs, comprising the following steps:
- a. constructing a query from said set of field-value pairs such that said query will return the maximum number of records from said database while satisfying a user-defined speed constraint, wherein said query is constructed by evaluating the expected record count associated with said query against said predetermined speed constraint;
  
  b. executing said query against said database to retrieve a set of records that satisfy said query; and
  
  c. passing said retrieved set of records to a matching algorithm which determines, for each record in said retrieved set of records, whether said retrieved record matches said set of at least one field-value pairs, whether said retrieved record does not match said set of at least one field-value pairs, or whether said matching algorithm can not determine whether said retrieved record matches said set of at least one field-value pairs.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The method of claim 7 where said query is composed of a set of subqueries.
  - 9. The method of claim 8, where each subquery in said set of subqueries is constructed so that it is estimated to return less than a user-defined maximum number of records.
  - 10. The method of claim 9, wherein said database comprises records, and further including:
    - augmenting each record of said database with at least one field-value pairs which are functions of said record to form an augmented database,augmenting said set of field-value pairs with at least one field-value pair which is a function of said set of field-value pairs to form an augmented set of field-value pairs, andexecuting each subquery to select records in said augmented database that match said set of augmented field-record pairs.
  - 11. The method of claim 10 further including eliminating at least one part of said query which is guaranteed to retrieve a subset of the records in the augmented database that would be retrieved by another part of said query.
  - 12. The method of claim 11 further including retrieving information from a data structure containing counts of the number of occurrences of a subset of the distinct field-value pairs present in the augmented database.
  - 13. The method of claim 10 further including forming said query by searching at least some possible subset of said augmented set of field-value pairs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
Choicemaker Technologies, Inc.
Inventors
Buechi, Martin, Borthwick, Andrew E., Goldberg, Arthur
Primary Examiner(s)
Wong, Leslie
Assistant Examiner(s)
DWIVEDI, MAHESH H

Application Number

US10/411,388
Publication Number

US 20040019593A1
Time in Patent Office

1,348 Days
Field of Search

707/200, 707/202, 707/203, 707/3, 707/4, 707/6
US Class Current

707/770
CPC Class Codes

G06F 16/24558   Binary matching operations

G06F 16/24578   using ranking

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Automated database blocking and record matching

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Automated database blocking and record matching

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links