Deduplication system
First Claim
1. A method for determining a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of data fields, the method comprising:
- for each of the first plurality of data fields, determining a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields;
for each of the second plurality of data fields, determining a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields;
determining a third value representing a difference between the first record and the second record based on the determined first and second values; and
identifying whether the first and second records are duplicates based on the determined third value, wherein the determining and identifying is provided by a processor;
wherein the determining of the third value comprises;
determining a sum of the determined first values and the determined second values; and
dividing the sum by two.
1 Assignment
0 Petitions
Accused Products
Abstract
A system to load data in a data warehouse includes reception of a plurality of records, determination, for each of the plurality of records, of values representing differences between a record and each other of the plurality of records, identification of at least two of the plurality records as duplicates based on a determined value representing a difference between the two records, and storage of the two records in the data warehouse in association with a same identifier. Determination of the values may include determination, for each of a first plurality of data fields of the record, of a first value representing a difference between data specified in the data field and data specified in a respective one of a second plurality of data fields of one of the other of the plurality of records, determination, for each of the second plurality of data fields, of a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields, and determination of a third value representing a difference between the record and the one of the other of the plurality of records based on the determined first and second values.
120 Citations
13 Claims
-
1. A method for determining a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of data fields, the method comprising:
-
for each of the first plurality of data fields, determining a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields; for each of the second plurality of data fields, determining a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields; determining a third value representing a difference between the first record and the second record based on the determined first and second values; and identifying whether the first and second records are duplicates based on the determined third value, wherein the determining and identifying is provided by a processor; wherein the determining of the third value comprises; determining a sum of the determined first values and the determined second values; and dividing the sum by two. - View Dependent Claims (2, 3)
-
-
4. A method for determining a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of date fields, the method comprising:
-
for each of the first plurality of data fields, determining a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields; for each of the second plurality of data fields, determining a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields; determining a third value representing a difference between the first record and the second record based on the determined first and second values; and identifying whether the first and second records are duplicates based on the determined third value, wherein the determining and identifying is provided by a processor, wherein the determining of the first value comprises; determining an asymmetric spelling distance as a normalized cost for converting first input data to second input data via a sequence of operations; and wherein the step of determining the second value comprises; determining an asymmetric spelling distance as a normalized cost for converting second input data to first input data via the sequence of operations. - View Dependent Claims (5, 6, 7, 8, 9)
-
-
10. An apparatus storing processor-executable instructions thereon to determine a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of data fields, the instructions comprising:
-
instructions to determine, for each of the first plurality of data fields, a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields; instructions to determine, for each of the second plurality of data fields, a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields; instructions to determine a third value representing a difference between the first record and the second record based on the determined first and second values; and instructions to identify whether the first and second records are duplicates based on the determined third value, wherein the instructions determine the first value comprises; instructions to determine an asymmetric spelling distance as a normalized cost for converting first input data to second input data via a sequence of operations; and wherein the instructions to determine the second value comprises; instructions to determine an asymmetric spelling distance as a normalized cost for converting second input data to first input data via the sequence of operations. - View Dependent Claims (11)
-
-
12. A data warehouse comprising:
-
a processor; and a storage device in communication with the processor and storing instructions adapted to be executed by the processor to; determine, for each of a first plurality of data fields of a first record, a first value representing a difference between data specified in the data field and data specified in a respective one of a second plurality of data fields of a second record, determine, for each of the second plurality of data fields, a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields, determine a third value representing a difference between the first record and the second record based on the determined first and second values; and identify whether the first and second records are duplicates based on the determined third value, wherein the instructions adapted to be executed by the processor to determine the first value comprise instructions adapted to be executed by the processor to; determine an asymmetric spelling distance as a normalized cost for converting first input data to second input data via a sequence of operations; and wherein the instructions adapted to be executed by the processor to determine the second value comprise instructions adapted to be executed by the processor to;
determine an asymmetric spelling distance as a normalized cost for converting the second input data to the first input data via the sequence of operations. - View Dependent Claims (13)
-
Specification