Method of merging large databases in parallel

US 5,717,915 A
Filed: 03/04/1996
Issued: 02/10/1998
Est. Priority Date: 03/15/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method for identifying duplicate records in a database, each record having at least one field and a plurality of keys, comprising the steps of pre-processing the records in the database using a thesaurus database to indicate relatedness, and:

(i)(a) sorting the records according to a criteria applied to a first key;

(b) comparing a number of consecutive sorted records to each other, wherein said number is less than a number of records in said database and identifying a first group of duplicate records;

(c) storing the identity of said first group;

(ii)(a) sorting the records according to a criteria applied to a second key;

(b) comparing a number of consecutive sorted records to each other, wherein said number is less than a number of records in said database and identifying a second group of duplicate records;

(c) storing the identity of said second group; and

(iii) subjecting the union of said first and second groups to transitive closure.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The semantic integration problem for merging multiple databases of very large size, the merge/purge problem, can be solved by multiple runs of the sorted neighborhood method or the clustering method with small windows followed by the computation of the transitive closure over the results of each run. The sorted neighborhood method works well under this scheme but is computationally expensive due to the sorting phase. An alternative method based on data clustering that reduces the complexity to linear time making multiple runs followed by transitive closure feasible and efficient. A method is provided for identifying duplicate records in a database, each record having at least one field and a plurality of keys, including the steps of sorting the records according to a criteria applied to a first key; comparing a number of consecutive sorted records to each other, wherein the number is less than a number of records in said database and identifying a first group of duplicate records; storing the identity of the first group; sorting the records according to a criteria applied to a second key; comparing a number of consecutive sorted records to each other, wherein the number is less than a number of records in said database and identifying a second group of duplicate records; storing the identity of the second group; and subjecting the union of the first and second groups to transitive closure.

Citations

4 Claims

1. A method for identifying duplicate records in a database, each record having at least one field and a plurality of keys, comprising the steps of pre-processing the records in the database using a thesaurus database to indicate relatedness, and:
- (i)(a) sorting the records according to a criteria applied to a first key;
  
  (b) comparing a number of consecutive sorted records to each other, wherein said number is less than a number of records in said database and identifying a first group of duplicate records;
  
  (c) storing the identity of said first group;
  
  (ii)(a) sorting the records according to a criteria applied to a second key;
  
  (b) comparing a number of consecutive sorted records to each other, wherein said number is less than a number of records in said database and identifying a second group of duplicate records;
  
  (c) storing the identity of said second group; and
  
  (iii) subjecting the union of said first and second groups to transitive closure.
- View Dependent Claims (2)
- - 2. The method according to claim 1, wherein said thesaurus database comprises linked records indicating related names and nicknames in a plurality of languages.

3. A method for identifying duplicate records in a database, each record having at least one field and a plurality of keys, comprising the steps of pre-processing the records of the database with a spelling checker, and:
- (i)(a) sorting the records according to a criteria applied to a first key;
  
  (b) comparing a number of consecutive sorted records to each other, wherein said number is less than a number of records in said database and identifying a first group of duplicate records;
  
  (c) storing the identity of said first group;
  
  (ii)(a) sorting the records according to a criteria applied to a second key;
  
  (b) comparing a number of consecutive sorted records to each other, wherein said number is less than a number of records in said database and identifying a second group of duplicate records;
  
  (c) storing the identity of said second group; and
  
  (iii) subjecting the union of said first and second groups to transitive closure.
- View Dependent Claims (4)
- - 4. The method according to claim 3, wherein said spelling checker compares a city field of each record with a list of correctly spelled city names.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lot 19 Acquisition Foundation LLC (Intellectual Ventures LLC)
Original Assignee
Mauricio A. Hernandez, Salvatore J. Stolfo
Inventors
Stolfo, Salvatore J., Hernandez, Mauricio A.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Corrielus, Jean M.

Application Number

US08/610,639
Time in Patent Office

708 Days
Field of Search

395/600, 395/605, 395/607, 395/795, 395/761
US Class Current

1/1
CPC Class Codes

G06F 16/24556   Aggregation; Duplicate elim...

G06F 16/5838   using colour

G06F 7/14   Merging, i.e. combining at ...

G06F 7/32   Merging, i.e. combining dat...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99937   Sorting

Method of merging large databases in parallel

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

4 Claims

Specification

Solutions

Use Cases

Quick Links

Method of merging large databases in parallel

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

4 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links