Bulk deduplication detection

US 10,152,497 B2
Filed: 02/24/2016
Issued: 12/11/2018
Est. Priority Date: 02/24/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

generating, by a database system, a first cluster of records from a group of records;

generating, by the database system, a second cluster of records from the group of records;

causing, by the database system, sets of duplicate records in the first cluster of records to be identified;

causing, by the database system, sets of duplicate records in the second cluster of records to be identified;

merging, by the database system, at least two sets of duplicate records associated with both the first cluster and the second cluster of records to form a merged set of duplicate records, wherein a set of duplicate records is implemented using a linked list having a head node and a body node for each record in the set of duplicate records and wherein the merging is performed based on the at least two sets of duplicate records having a common record and comprises merging a linked list associated with each set of duplicate records; and

removing, by the database system, one or more duplicate records from the merged set of duplicate records.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Some embodiments of the present invention include a system and method for removing duplicate records from a group of records in a database system. The method includes generating a first cluster of records from the group of records, generating a second cluster of records from the group of records, identifying sets of duplicate records in the first cluster of records, and identifying sets of duplicate records in the second cluster of records. The method also includes merging at least two sets of duplicate records associated with both the first cluster and the second cluster of records to form a merged set of duplicate records. The merging is performed based on the at least two sets of duplicate records having a common record. Duplicate records in the group of records may then be removed by removing duplicate records from the merged set of duplicate records.

Citations

20 Claims

1. A computer-implemented method comprising:
- generating, by a database system, a first cluster of records from a group of records;
  
  generating, by the database system, a second cluster of records from the group of records;
  
  causing, by the database system, sets of duplicate records in the first cluster of records to be identified;
  
  causing, by the database system, sets of duplicate records in the second cluster of records to be identified;
  
  merging, by the database system, at least two sets of duplicate records associated with both the first cluster and the second cluster of records to form a merged set of duplicate records, wherein a set of duplicate records is implemented using a linked list having a head node and a body node for each record in the set of duplicate records and wherein the merging is performed based on the at least two sets of duplicate records having a common record and comprises merging a linked list associated with each set of duplicate records; and
  
  removing, by the database system, one or more duplicate records from the merged set of duplicate records.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein removing the one or more duplicate records from the merged set of duplicate records comprises removing the one or more duplicate records from the group of records.
  - 3. The method of claim 2, wherein the first cluster of records and the second cluster of records are generated based on one or more keys.
  - 4. The method of claim 3, wherein the first cluster of records and the second cluster of records are subsets of the group of records, and wherein the records in the first cluster and the records in the second cluster are not mutually exclusive.
  - 5. The method of claim 4, wherein the merging of the at least two sets of duplicate records comprises:
    - selecting, by the database system, a record from a first set of duplicate records;
      
      comparing, by the database system, the selected record with records in a second set of duplicate records; and
      
      merging, by the database system, the first set of duplicate records with the second set of duplicate records based on matching the selected record with any one record in the second set of duplicate records.
  - 6. The method of claim 5, wherein the merging the first set of duplicate records with the second set of duplicate records comprises merging a set of duplicate records with few records to a set of duplicate records with more records.
  - 7. The method of claim 6, wherein size information of the set of duplicate records and an identification information of the set of duplicate records are stored in the head node, and wherein the merging the first set of duplicate records with the second set of duplicate records comprises merging a linked list associated with the first set of duplicate records with a linked list associated with the second set of duplicate records.

8. An apparatus for identifying duplicate records in a database object, the apparatus comprising:
- one or more processors; and
  
  a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to;
  
  generate a first cluster of records from a group of records;
  
  generate a second cluster of records from the group of records;
  
  cause sets of duplicate records in the first cluster of records to be identified;
  
  cause sets of duplicate records in the second cluster of records to be identified;
  
  merge at least two sets of duplicate records associated with both the first cluster and the second cluster of records to form a merged set of duplicate records, wherein a set of duplicate records is implemented using a linked list having a head node and a body node for each record in the set of duplicate records and wherein the merging is performed based on the at least two sets of duplicate records having a common record and comprises merging a linked list associated with each set of duplicate records; and
  
  remove one or more duplicate records from the merged set of duplicate records.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus of claim 8, wherein removing the one or more duplicate records from the merged set of duplicate records comprises removing the one or more duplicate records from the group of records.
  - 10. The apparatus of claim 9, wherein the first cluster of records and the second cluster of records are generated based on one or more keys.
  - 11. The apparatus of claim 10, wherein the first cluster of records and the second cluster of records are subsets of the group of records, and wherein the records in the first cluster and the records in the second cluster are not mutually exclusive.
  - 12. The apparatus of claim 11, wherein the merging of the at least two sets of duplicate records comprises:
    - selecting a record from a first set of duplicate records;
      
      comparing the selected record with records in a second set of duplicate records; and
      
      merging the first set of duplicate records with the second set of duplicate records based on matching the selected record with any one record in the second set of duplicate records.
  - 13. The apparatus of claim 12, wherein the merging the first set of duplicate records with the second set of duplicate records comprises merging a set of duplicate records with few records to a set of duplicate records with more records.
  - 14. The apparatus of claim 13, wherein size information of the set of duplicate records and an identification information of the set of duplicate records are stored in the head node, and wherein the merging the first set of duplicate records with the second set of duplicate records comprises merging a linked list associated with the first set of duplicate records with a linked list associated with the second set of duplicate records.

15. A computer program product comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to:
- generate a first cluster of records from the group of records;
  
  generate a second cluster of records from the group of records;
  
  cause sets of duplicate records in the first cluster of records to be identified;
  
  cause sets of duplicate records in the second cluster of records to be identified;
  
  merge at least two sets of duplicate records associated with both the first cluster and the second cluster of records to form a merged set of duplicate records, wherein a set of duplicate records is implemented using a linked list having a head node and a body node for each record in the set of duplicate records and wherein the merging is performed based on the at least two sets of duplicate records having a common record and comprises merging a linked list associated with each set of duplicate records; and
  
  remove one or more duplicate records from the merged set of duplicate records.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer program product of claim 15, wherein removing the one or more duplicate records from the merged set of duplicate records comprises removing the one or more duplicate records from the group of records.
  - 17. The computer program product of claim 16, wherein the first cluster of records and the second cluster of records are generated based on one or more keys.
  - 18. The computer program product of claim 17, wherein the first cluster of records and the second cluster of records are subsets of the group of records, and wherein the records in the first cluster and the records in the second cluster are not mutually exclusive.
  - 19. The computer program product of claim 18, wherein the merging of the at least two sets of duplicate records comprises:
    - selecting a record from a first set of duplicate records;
      
      comparing the selected record with records in a second set of duplicate records; and
      
      merging the first set of duplicate records with the second set of duplicate records based on matching the selected record with any one record in the second set of duplicate records.
  - 20. The computer program product of claim 19, wherein the merging the first set of duplicate records with the second set of duplicate records comprises merging a set of duplicate records with few records to a set of duplicate records with more records, wherein size information of the set of duplicate records and an identification information of the set of duplicate records are stored in the head node, and wherein the merging the first set of duplicate records with the second set of duplicate records comprises merging a linked list associated with the first set of duplicate records with a linked list associated with the second set of duplicate records.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Doan, Dai Duong, Jagota, Arun Kumar, Ker, Chenghung, Vaishnav, Parth, Dvinov, Danil, Kudriavtsev, Dmytro
Primary Examiner(s)
Nguyen, Merilyn P

Application Number

US15/052,382
Publication Number

US 20170242868A1
Time in Patent Office

1,021 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/24556   Aggregation; Duplicate elim...

G06F 16/285   Clustering or classification

G06F 7/32   Merging, i.e. combining dat...

Bulk deduplication detection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Bulk deduplication detection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links