EFFICIENT TOP-K QUERY EVALUATION ON PROBABILISTIC DATA
First Claim
1. A method for efficiently automatically determining a number of top-rated entities selected from a group of entities to satisfy a condition, wherein the top-rated entities are rated on a criteria that is computed for a set of entities that may satisfy the condition, comprising the steps of:
- (a) determining an initial range of criteria for each entity in the set of entities;
(b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each entity;
(c) selecting a subset of entities from the set on which to run further iterative computations to determine a refined range of criteria for each entity of the subset of entities, wherein selection of entities to be included in the subset is based upon the range of criteria previously determined for the entities;
(d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the entities in the subset, the number of entities that are above the current critical range then comprising the number of top-rated entities; and
(e) presenting the number of top-rated entities to a user.
3 Assignments
0 Petitions
Accused Products
Abstract
A novel approach that computes and efficiently ranks the top-k answers to a query on a probabilistic database. The approach identifies the top-k answers, since imprecisions in the data often lead to a large number of answers of low quality. The algorithm is used to run several Monte Carlo simulations in parallel, one for each candidate answer, and approximates the probability of each only to the extent needed to correctly determine the top-k answers. The algorithm is provably optimal and scales to large databases. A more general application can identify a number of top-rated entities of a group that satisfy a condition, based on a criteria or score computed for the entities. Also disclosed are several optimization techniques. One option is to rank the top-rated results; another option provides for interrupting the iteration to return the number of top-rated entities that have thus far been identified.
32 Citations
25 Claims
-
1. A method for efficiently automatically determining a number of top-rated entities selected from a group of entities to satisfy a condition, wherein the top-rated entities are rated on a criteria that is computed for a set of entities that may satisfy the condition, comprising the steps of:
-
(a) determining an initial range of criteria for each entity in the set of entities; (b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each entity; (c) selecting a subset of entities from the set on which to run further iterative computations to determine a refined range of criteria for each entity of the subset of entities, wherein selection of entities to be included in the subset is based upon the range of criteria previously determined for the entities; (d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the entities in the subset, the number of entities that are above the current critical range then comprising the number of top-rated entities; and (e) presenting the number of top-rated entities to a user. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for efficiently automatically determining a number of top-rated entities selected from a group of entities to satisfy a condition, wherein the top-rated entities are rated on a criteria that is computed for a set of entities that may satisfy the condition, comprising:
-
(a) a memory in which the group of entities are stored and in which a plurality of machine executable instructions are stored; (b) a user input for enabling a user to control the system and provide input data; (c) an output device for presenting information to a user; and (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including; (i) determining an initial range of criteria for each entity in the set of entities; (ii) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each entity; (iii) selecting a subset of entities from the set on which to run further iterative computations to determine a refined range of criteria for each entity of the subset of entities, wherein selection of entities to be included in the subset is based upon the range of criteria previously determined for the entities; (iv) repeating functions (ii) and (iii) until a current critical range does not include any portion of a refined range of criteria for any of the entities in the subset, the number of entities that are above the current critical range then comprising the number of top-rated entities; and (v) presenting the number of top-rated entities to a user with the output device. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method for efficiently determining a number k of top-rated answers in response to a query of a database that includes imprecise data, so that each top-rated answer is associated with a probability that the answer is correct that is greater than that of all other answers in a set of possible answers to the query, and wherein determining the probability that an answer is correct requires an unknown number of iterative computations, comprising the steps of:
-
(a) repetitively running a computation on each possible answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible answer is correct; (b) selecting a current critical region between a critical lower bound and a critical upper bound of probability; (c) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible answers and the critical lower bound and critical upper bound of the critical region, selecting possible answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible answer selected; (d) iteratively repeating steps (b) and (c) until refined approximated lower bounds of each of k possible answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible answers are the k top-rated answers to the query; and (e) presenting the k top-rated answers to a user. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
-
21. A system for efficiently determining a number k of top-rated answers in response to a query of a database that includes imprecise data, so that each top-rated answer is associated with a probability that the answer is correct that is greater than that of all other answers in a set of possible answers to the query, and wherein determining the probability that an answer is correct requires an unknown number of iterative computations, comprising:
-
(a) a memory in which the imprecise data are stored and in which a plurality of machine executable instructions are stored; (b) a user input for enabling a user to control the system and provide input data; (c) an output device for presenting information to a user; and (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including; (i) repetitively running a computation on each possible answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible answer is correct; (ii) selecting a current critical region between a critical lower bound and a critical upper bound of probability; (iii) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible answers and the critical lower bound and critical upper bound of the critical region, selecting possible answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible answer selected; (iv) iteratively repeating functions (ii) and (iii) until refined approximated lower bounds of each of k possible answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible answers are the k top-rated answers to the query; and (v) presenting the k top-rated answers to a user. - View Dependent Claims (22, 23, 24, 25)
-
Specification