Method and system for large scale data curation
First Claim
1. A computer implemented method for performing object linkage in computer memory on object pairs from one or more database storage sources, in order to separate said object pairs into linked object pairs and non-linked object pairs, comprising:
- applying rules represented as a Boolean formula in disjunctive normal form (DNF) as shown in FORMULA 1 to said object pairs,wherein said FORMULA 1 is constructed with attribute similarity predicates, andwherein said FORMULA 1 is constructed such that most of said linked object pairs satisfy said FORMULA 1 while a minimal number of said non-linked object pairs satisfy said FORMULA 1; and
generating initial rules in disjunctive normal form (DNF) based on collected statistics from said database storage sources, and based on hints from data experts, wherein said rules guarantee high recall and moderate precision, and wherein said hints consist of keys and anti-keys.
5 Assignments
0 Petitions
Accused Products
Abstract
An end-to-end data curation system and the various methods used in linking, matching, and cleaning large-scale data sources. The goal of this system is to provide scalable and efficient record deduplication. The system uses a crowd of experts to train the system. The system operator can optionally provide a set of hints to reduce the number of questions send to the experts. The system solves the problem of schema mapping and record deduplication a holistic way by unifying these problems into a unified linkage problem.
29 Citations
6 Claims
-
1. A computer implemented method for performing object linkage in computer memory on object pairs from one or more database storage sources, in order to separate said object pairs into linked object pairs and non-linked object pairs, comprising:
-
applying rules represented as a Boolean formula in disjunctive normal form (DNF) as shown in FORMULA 1 to said object pairs, wherein said FORMULA 1 is constructed with attribute similarity predicates, and wherein said FORMULA 1 is constructed such that most of said linked object pairs satisfy said FORMULA 1 while a minimal number of said non-linked object pairs satisfy said FORMULA 1; and generating initial rules in disjunctive normal form (DNF) based on collected statistics from said database storage sources, and based on hints from data experts, wherein said rules guarantee high recall and moderate precision, and wherein said hints consist of keys and anti-keys. - View Dependent Claims (2, 3, 4)
-
-
5. A computer implemented method to reduce training data required for record deduplication comprising:
-
selecting a small sample of records from one or more database storage sources; linking, in computer memory, said records into records pairs based on a linkage model; selecting a small subset of said record pairs from said database storage sources using stratified sampling, wherein said stratified sampling includes; partitioning said selected record pairs into bins such that each of said bins has a specified similarity range, creating questions about the similarity of said record pairs, selecting a number of said questions from each of said bins proportional to the square root of the bin size multiplied by the variance of the labels in the bin, sending said questions to data experts for labeling into labeled questions, and generating an enhanced linkage model from said labeled questions; and continuing said stratified sampling iteratively until the precision and recall of said enhanced linkage model are above the minimum precision and recall that are required by a system operator;
orthe precision and recall of said enhanced linkage model did not significantly change from the previous version of said enhanced linkage model.
-
-
6. An end-to-end data curation system comprising:
-
a expert subsystem having; data experts; and a question generation and question selector subsystem implemented on one or more question generation and question selector computers operable to generate questions for said data experts to turn said questions into answered questions; a database source subsystem implemented as software code on one or more database source computers operable to store one or more database sources; a data cleaning subsystem implemented as software code on one or more data cleaning computers operable to normalize and transform raw data from said database sources into cleaned data; a model generator subsystem implemented as software code on one or more data model generator computers operable to; receive collected statistics about said cleaned data; receive high level linkage criteria from said data experts; and generate an initial linkage model from said collected statistics and said high level linkage criteria for said cleaned data, wherein said high level linkage criteria includes hints from said data experts, wherein said hints consist of keys and anti-keys, wherein said initial linkage model is based on a initial rules represented as a Boolean formula in disjunctive normal form (DNF) as shown in FORMULA 1, wherein said initial rules guarantee high recall and moderate precision; and an object linkage subsystem implemented as software code on one or more object linkage computers operable to; perform object linkage on object pairs from said cleaned data in order to separate said object pairs into linked object pairs and non-linked object pairs, wherein said object linkage includes selecting an optimal set of features from said object pairs to identify possibly linked object pairs; and improve said initial linkage model into an enhanced linkage model by; abstracting rows/records from said cleaned data into a first set of objects, abstracting columns/fields/attributes from said cleaned data into a second set of objects, and iteratively performing object linkage on said first said of objects and said second set of objects, wherein said object linkage performed on said first set of objects performs the task of said record deduplication, and wherein said object linkage performed on said second set of object performs the task of said schema mapping.
-
Specification