Efficient top-K query evaluation on probabilistic data
First Claim
1. A computer-implemented method for efficiently automatically determining a number of top-rated probabilistic entities selected from a group of probabilistic entities to satisfy a condition, wherein the top-rated probabilistic entities are rated on a criteria that is computed for a set of probabilistic entities that satisfy the condition, the method comprising the steps of:
- (a) determining an initial range of criteria for each probabilistic entity in the set of probabilistic entities, wherein determining the initial range includes performing a probabilistic, simulation-based computation on each probabilistic entity in the set of entities;
(b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each probabilistic entity;
(c) selecting a subset of probabilistic entities from the set on which to run further iterative computations to determine a refined range of criteria for each probabilistic entity of the subset of probabilistic entities, wherein selection of probabilistic entities to be included in the subset is based upon the range of criteria previously determined for the probabilistic entities, and wherein the subset of probabilistic entities includes an entity with first data and another entity with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate;
(d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the probabilistic entities in the subset, the number of probabilistic entities that are above the current critical range then comprising the number of top-rated probabilistic entities; and
(e) presenting the number of top-rated probabilistic entities to a user, wherein the step of computing the current critical range of criteria comprises the steps of;
setting a lower critical bound for the current critical range of criteria based upon a topk refined lower bound, determined by running the computations on the probabilistic entities, where the topk refined lower bound is a kth largest refined lower bound of the probabilistic entities; and
setting an upper critical bound for the current critical range based upon a topk+1 refined upper bound for the probabilistic entities, determined by running the computations on the probabilistic entities, where the topk+1 refined upper bound is a k +1th largest refined upper bound of the probabilistic entities.
3 Assignments
0 Petitions
Accused Products
Abstract
A novel approach that computes and efficiently ranks the top-k answers to a query on a probabilistic database. The approach identifies the top-k answers, since imprecisions in the data often lead to a large number of answers of low quality. The algorithm is used to run several Monte Carlo simulations in parallel, one for each candidate answer, and approximates the probability of each only to the extent needed to correctly determine the top-k answers. The algorithm is provably optimal and scales to large databases. A more general application can identify a number of top-rated entities of a group that satisfy a condition, based on a criteria or score computed for the entities. Also disclosed are several optimization techniques. One option is to rank the top-rated results; another option provides for interrupting the iteration to return the number of top-rated entities that have thus far been identified.
16 Citations
21 Claims
-
1. A computer-implemented method for efficiently automatically determining a number of top-rated probabilistic entities selected from a group of probabilistic entities to satisfy a condition, wherein the top-rated probabilistic entities are rated on a criteria that is computed for a set of probabilistic entities that satisfy the condition, the method comprising the steps of:
-
(a) determining an initial range of criteria for each probabilistic entity in the set of probabilistic entities, wherein determining the initial range includes performing a probabilistic, simulation-based computation on each probabilistic entity in the set of entities; (b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each probabilistic entity; (c) selecting a subset of probabilistic entities from the set on which to run further iterative computations to determine a refined range of criteria for each probabilistic entity of the subset of probabilistic entities, wherein selection of probabilistic entities to be included in the subset is based upon the range of criteria previously determined for the probabilistic entities, and wherein the subset of probabilistic entities includes an entity with first data and another entity with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate; (d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the probabilistic entities in the subset, the number of probabilistic entities that are above the current critical range then comprising the number of top-rated probabilistic entities; and (e) presenting the number of top-rated probabilistic entities to a user, wherein the step of computing the current critical range of criteria comprises the steps of; setting a lower critical bound for the current critical range of criteria based upon a topk refined lower bound, determined by running the computations on the probabilistic entities, where the topk refined lower bound is a kth largest refined lower bound of the probabilistic entities; and setting an upper critical bound for the current critical range based upon a topk+1 refined upper bound for the probabilistic entities, determined by running the computations on the probabilistic entities, where the topk+1 refined upper bound is a k +1th largest refined upper bound of the probabilistic entities. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for efficiently automatically determining a number of top-rated probabilistic entities selected from a group of probabilistic entities to satisfy a condition, wherein the top-rated probabilistic entities are rated on a criteria that is computed for a set of probabilistic entities that may satisfy the condition, comprising:
-
(a) a memory in which the group of probabilistic entities are stored and in which a plurality of machine executable instructions are stored; (b) a user input for enabling a user to control the system and provide input data; (c) an output device for presenting information to a user; and (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including; (i) determining an initial range of criteria for each probabilistic entity in the set of probabilistic entities, wherein determining the initial range includes performing a probabilistic, simulation-based computation on each probabilistic entity in the set of entities; (ii) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each probabilistic entity; (iii) selecting a subset of probabilistic entities from the set on which to run further iterative computations to determine a refined range of criteria for each probabilistic entity of the subset of probabilistic entities, wherein selection of probabilistic entities to be included in the subset is based upon the range of criteria previously determined for the probabilistic entities, and wherein the subset of probabilistic entities includes an entity with first data and another entity with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate; (iv) repeating functions (ii) and (iii) until a current critical range does not include any portion of a refined range of criteria for any of the probabilistic entities in the subset, the number of probabilistic entities that are above the current critical range then comprising the number of top-rated probabilistic entities; and (v) presenting the number of top-rated probabilistic entities to a user with the output device, wherein the machine executable instructions further cause the processor to; set a lower critical bound for the current critical range of criteria based upon a topk refined lower bound, determined by running the computations on the probabilistic entities, where the topk refined lower bound is a kth largest refined lower bound of the probabilistic entities; and set an upper critical bound for the current critical range based upon a topk+1 refined upper bound for the probabilistic entities, determined by running the computations on the probabilistic entities, where the topk+1 refined upper bound is a k+1th largest refined upper bound of the probabilistic entities. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer-implemented method for efficiently determining a number k of top-rated probabilistic answers in response to a query of a database that includes imprecise data, so that each top-rated probabilistic answer is associated with a probability that the probabilistic answer is correct that is greater than that of all other probabilistic answers in a set of possible probabilistic answers to the query, and wherein determining the probability that an probabilistic answer is correct requires an unknown number of iterative computations, the method comprising the steps of:
-
(a) repetitively running a computation on each possible probabilistic answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible probabilistic answer is correct, wherein running each computation includes performing a probabilistic, simulation-based computation on each possible probabilistic answer in the set; (b) selecting a current critical region between a critical lower bound and a critical upper bound of probability; (c) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible probabilistic answers and the critical lower bound and critical upper bound of the critical region, selecting possible probabilistic answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible probabilistic answer selected, wherein the imprecise data includes an probabilistic answer with first data and another probabilistic answer with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate; (d) iteratively repeating steps (b) and (c) until refined approximated lower bounds of each of k possible probabilistic answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible probabilistic answers are the k top-rated probabilistic answers to the query; and (e) presenting the k top-rated probabilistic answers to a user, wherein the step of selecting the current critical region comprises the steps of; setting the lower critical bound for the current critical region based upon a topk refined lower bound determined by running the computations on the possible probabilistic answers, where the topk refined lower bound is a kth largest refined lower bound of the probabilistic answers; and setting the upper critical bound for the current critical region based upon a topk+1 refined upper bound for the possible probabilistic answers, determined by running the computations on the possible probabilistic answers, where the topk+1 refined upper bound is a k+1th largest refined upper bound of the probabilistic answers. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A system for efficiently determining a number k of top-rated probabilistic answers in response to a query of a database that includes imprecise data, so that each top-rated probabilistic answer is associated with a probability that the answer is correct that is greater than that of all other answers in a set of possible probabilistic answers to the query, and wherein determining the probability that an probabilistic answer is correct requires an unknown number of iterative computations, comprising:
-
(a) a memory in which the imprecise data are stored and in which a plurality of machine executable instructions are stored, wherein running each computation includes performing a probabilistic, simulation-based computation on each possible probabilistic answer in the set; (b) a user input for enabling a user to control the system and provide input data; (c) an output device for presenting information to a user; and (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including; (i) repetitively running a computation on each possible probabilistic answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible probabilistic answer is correct; (ii) selecting a current critical region between a critical lower bound and a critical upper bound of probability; (iii) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible probabilistic answers and the critical lower bound and critical upper bound of the critical region, selecting possible probabilistic answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible probabilistic answer selected, wherein the imprecise data includes a probabilistic answer with first data and another probabilistic answer with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate; (iv) iteratively repeating functions (ii) and (iii) until refined approximated lower bounds of each of k possible answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible probabilistic answers are the k top-rated probabilistic answers to the query; and (v) presenting the k top-rated probabilistic answers to a user, wherein the machine executable instructions further cause the processor to; set the lower critical bound for the current critical region based upon a topk refined lower bound determined by running the computations on the possible probabilistic answers, where the topk refined lower bound is a kth largest refined lower bound of the probabilistic answers; and set the upper critical bound for the current critical region based upon a topk+1 refined upper bound for the possible probabilistic answers, determined by running the computations on the possible probabilistic answers, where the topk+1 refined upper bound is a k+1th largest refined upper bound of the probabilistic answers. - View Dependent Claims (19, 20, 21)
-
Specification