Deduplicaiton system
First Claim
1. A method for determining a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of data fields, the method comprising:
- for each of the first plurality of data fields, determining a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields;
for each of the second plurality of data fields, determining a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields; and
determining a third value representing a difference between the first record and the second record based on the determined first and second values.
1 Assignment
0 Petitions
Accused Products
Abstract
A system to load data in a data warehouse includes reception of a plurality of records, determination, for each of the plurality of records, of values representing differences between a record and each other of the plurality of records, identification of at least two of the plurality records as duplicates based on a determined value representing a difference between the two records, and storage of the two records in the data warehouse in association with a same identifier. Determination of the values may include determination, for each of a first plurality of data fields of the record, of a first value representing a difference between data specified in the data field and data specified in a respective one of a second plurality of data fields of one of the other of the plurality of records, determination, for each of the second plurality of data fields, of a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields, and determination of a third value representing a difference between the record and the one of the other of the plurality of records based on the determined first and second values.
82 Citations
40 Claims
-
1. A method for determining a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of data fields, the method comprising:
-
for each of the first plurality of data fields, determining a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields;
for each of the second plurality of data fields, determining a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields; and
determining a third value representing a difference between the first record and the second record based on the determined first and second values. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for use in loading data in a data warehouse, comprising:
-
receiving a plurality of records, each of the plurality of records including a plurality of data fields;
identifying a plurality of groups of records, wherein data specified in one or more of the plurality of data fields included in a record of a group is identical to data specified in one or more corresponding data fields included in each other record of the group;
determining, for each group, values representing differences between each record of a group and each other record of the group; and
identifying at least two of the plurality records as duplicates based on a determined value representing a difference between the two records. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method for loading data in a data warehouse storing existing records, comprising:
-
receiving a plurality of new records;
for each of the plurality of new records, determining values representing differences between a new record and one or more of the existing records;
identifying at least one of the plurality of new records and one of the existing records as duplicates based on a determined value representing a difference between the two records; and
storing the at least one of the plurality of new records in the data warehouse in association with an identifier identical to an identifier associated with the one of the existing records. - View Dependent Claims (23, 24, 25, 26)
-
-
27. A method for loading data in a data warehouse, comprising:
-
receiving a plurality of records;
for each of the plurality of records, determining values representing differences between a record and each other of the plurality of records;
identifying at least two of the plurality records as duplicates based on a determined value representing a difference between the two records; and
storing the two records in the data warehouse in association with a same identifier. - View Dependent Claims (28, 29)
-
-
30. A system for storing data, comprising:
-
a device for transmitting a plurality of new records; and
a data warehouse for storing existing records, for receiving the transmitted plurality of records, for determining values representing differences between a new record and one or more of the existing records for each of the plurality of new records, for identifying at least one of the plurality of new records and one of the existing records as duplicates based on a determined value representing a difference between the two records, and for storing the at least one of the plurality of new records in association with an identifier identical to an identifier associated with the one of the existing records. - View Dependent Claims (31)
-
-
32. A computer-readable medium storing processor-executable process steps to determine a value representing a difference between a first record comprising a first plurality of data fields and a second record comprising a second plurality of data fields, each of the first plurality of data fields corresponding to a respective one of the second plurality of data fields, the steps comprising:
-
a step to determine, for each of the first plurality of data fields, a first value representing a difference between data specified in the data field and data specified in a respective one of the second plurality of data fields;
a step to determine, for each of the second plurality of data fields, a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields; and
a step to determine a third value representing a difference between the first record and the second record based on the determined first and second values. - View Dependent Claims (33, 34, 35, 36)
-
-
37. A data warehouse, comprising:
-
a processor; and
a storage device in communication with the processor and storing instructions adapted to be executed by the processor to;
receive a plurality of records;
determine values representing differences between a new record and one or more of the existing records for each of the plurality of new records;
identify at least one of the plurality of new records and one of the existing records as duplicates based on a determined value representing a difference between the two records; and
store the at least one of the plurality of new records in association with an identifier identical to an identifier associated with the one of the existing records.
-
-
38. A data warehouse, comprising:
-
a processor; and
a storage device in communication with the processor and storing instructions adapted to be executed by the processor to;
determine, for each of a first plurality of data fields of a first record, a first value representing a difference between data specified in the data field and data specified in a respective one of a second plurality of data fields of a second record, determine, for each of the second plurality of data fields, a second value representing a difference between data specified in the data field and data specified in a respective one of the first plurality of data fields, and determine a third value representing a difference between the first record and the second record based on the determined first and second values. - View Dependent Claims (39, 40)
-
Specification