Robust detector of fuzzy duplicates
First Claim
Patent Images
1. One or more processor-readable program media having processor-executable instructions that, when executed by a processor, perform acts comprising:
- obtaining a dataset comprising multiple tuples from a database;
for each of the multiple tuples of the dataset, computing one or more nearest neighbor tuples in the dataset;
defining multiple disjoint partitions of multiple tuples, wherein tuples in each partition comprise fuzzy duplicates of one another, such that each fuzzy duplicate tuple in a partition represents a common real world entity or phenomenon.
2 Assignments
0 Petitions
Accused Products
Abstract
At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
-
Citations
21 Claims
-
1. One or more processor-readable program media having processor-executable instructions that, when executed by a processor, perform acts comprising:
-
obtaining a dataset comprising multiple tuples from a database;
for each of the multiple tuples of the dataset, computing one or more nearest neighbor tuples in the dataset;
defining multiple disjoint partitions of multiple tuples, wherein tuples in each partition comprise fuzzy duplicates of one another, such that each fuzzy duplicate tuple in a partition represents a common real world entity or phenomenon. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A fuzzy duplicate elimination system comprising:
-
a dataset obtaining means for obtaining a dataset comprising multiple tuples from a database;
a computing means for computing one or more nearest neighbor tuples in the dataset for each of the multiple tuples of the dataset;
a partitioning means for defining multiple disjoint partitions of multiple tuples, wherein seemingly distinct tuples in each partition comprise fuzzy duplicates of one another, such that each fuzzy duplicate tuple in a partition represents a common real world entity or phenomenon;
a duplicate-elimination means for eliminating duplicates of multiple fuzzy duplicate tuples within the multiple partitions so that each of such partitions is left with one, unduplicated tuple. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A method for eliminating fuzzy duplicate tuples in a dataset, the method comprising:
-
for each of multiple tuples in a dataset, computing one or more nearest neighbor tuples;
delimitating delimiting multiple disjoint partitions of multiple tuples, wherein seemingly distinct tuples in each partition comprise fuzzy duplicates of one another, such that each fuzzy duplicate tuple in a partition represents a common real world entity or phenomenon;
eliminating duplicates within the multiple partitions of multiple fuzzy duplicate tuples so that each of such partitions is left with one, unduplicated tuple. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification