Efficient top-K query evaluation on probabilistic data

US 7,814,113 B2
Filed: 11/05/2007
Issued: 10/12/2010
Est. Priority Date: 11/07/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for efficiently automatically determining a number of top-rated probabilistic entities selected from a group of probabilistic entities to satisfy a condition, wherein the top-rated probabilistic entities are rated on a criteria that is computed for a set of probabilistic entities that satisfy the condition, the method comprising the steps of:

(a) determining an initial range of criteria for each probabilistic entity in the set of probabilistic entities, wherein determining the initial range includes performing a probabilistic, simulation-based computation on each probabilistic entity in the set of entities;

(b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each probabilistic entity;

(c) selecting a subset of probabilistic entities from the set on which to run further iterative computations to determine a refined range of criteria for each probabilistic entity of the subset of probabilistic entities, wherein selection of probabilistic entities to be included in the subset is based upon the range of criteria previously determined for the probabilistic entities, and wherein the subset of probabilistic entities includes an entity with first data and another entity with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate;

(d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the probabilistic entities in the subset, the number of probabilistic entities that are above the current critical range then comprising the number of top-rated probabilistic entities; and

(e) presenting the number of top-rated probabilistic entities to a user, wherein the step of computing the current critical range of criteria comprises the steps of;

setting a lower critical bound for the current critical range of criteria based upon a top_krefined lower bound, determined by running the computations on the probabilistic entities, where the top_krefined lower bound is a k^thlargest refined lower bound of the probabilistic entities; and

setting an upper critical bound for the current critical range based upon a top_k+1refined upper bound for the probabilistic entities, determined by running the computations on the probabilistic entities, where the top_k+1refined upper bound is a k +1^thlargest refined upper bound of the probabilistic entities.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A novel approach that computes and efficiently ranks the top-k answers to a query on a probabilistic database. The approach identifies the top-k answers, since imprecisions in the data often lead to a large number of answers of low quality. The algorithm is used to run several Monte Carlo simulations in parallel, one for each candidate answer, and approximates the probability of each only to the extent needed to correctly determine the top-k answers. The algorithm is provably optimal and scales to large databases. A more general application can identify a number of top-rated entities of a group that satisfy a condition, based on a criteria or score computed for the entities. Also disclosed are several optimization techniques. One option is to rank the top-rated results; another option provides for interrupting the iteration to return the number of top-rated entities that have thus far been identified.

16 Citations

View as Search Results

21 Claims

1. A computer-implemented method for efficiently automatically determining a number of top-rated probabilistic entities selected from a group of probabilistic entities to satisfy a condition, wherein the top-rated probabilistic entities are rated on a criteria that is computed for a set of probabilistic entities that satisfy the condition, the method comprising the steps of:
- (a) determining an initial range of criteria for each probabilistic entity in the set of probabilistic entities, wherein determining the initial range includes performing a probabilistic, simulation-based computation on each probabilistic entity in the set of entities;
  
  (b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each probabilistic entity;
  
  (c) selecting a subset of probabilistic entities from the set on which to run further iterative computations to determine a refined range of criteria for each probabilistic entity of the subset of probabilistic entities, wherein selection of probabilistic entities to be included in the subset is based upon the range of criteria previously determined for the probabilistic entities, and wherein the subset of probabilistic entities includes an entity with first data and another entity with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate;
  
  (d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the probabilistic entities in the subset, the number of probabilistic entities that are above the current critical range then comprising the number of top-rated probabilistic entities; and
  
  (e) presenting the number of top-rated probabilistic entities to a user, wherein the step of computing the current critical range of criteria comprises the steps of;
  
  setting a lower critical bound for the current critical range of criteria based upon a top_krefined lower bound, determined by running the computations on the probabilistic entities, where the top_krefined lower bound is a k^thlargest refined lower bound of the probabilistic entities; and
  
  setting an upper critical bound for the current critical range based upon a top_k+1refined upper bound for the probabilistic entities, determined by running the computations on the probabilistic entities, where the top_k+1refined upper bound is a k +1^thlargest refined upper bound of the probabilistic entities.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, further comprising the step of ranking the number of top-rated probabilistic entities by the range of criteria computed for each probabilistic entity.
  - 3. The method of claim 1, further comprising the step of enabling a user to terminate iterative repetition of steps (b) and (c) at any time, returning an ordered set of top-rated probabilistic entities determined up to that time, without regard to any specified number of probabilistic entities.
  - 4. The method of claim 1, wherein the step of selecting the subset of probabilistic entities for repetitively running the computations comprises the steps of:
    - (a) selecting each probabilistic entity for which a lower bound of the refined criteria is less than a critical lower bound of the current critical range of criteria and an upper bound of the refined criteria is greater than a critical upper bound of the current critical range of criteria; and
      
      if no probabilistic entity is selected;
      
      then,(b) selecting each pair of probabilistic entities, wherein for a first probabilistic entity of the pair, the lower bound of the refined criteria is less than the critical lower bound, and for the second probabilistic entity of the pair, the upper bound of the refined criteria is greaterthan the critical upper bound of the current critical range of criteria; and
      
      , if no pair of probabilistic entities is thus selected;
      
      then,(c) selecting each probabilistic entity for which a range between the lower bound of the refined criteria and the upper bound of the refined criteria includes corresponding ranges of all other probabilistic entities.
  - 5. The method of claim 1, further comprising the step of initially reducing an extent of the critical range of criteria before iteratively running the computations repetitively on each probabilistic entity in the subset, by statically evaluating groups of the probabilistic entities.

6. A system for efficiently automatically determining a number of top-rated probabilistic entities selected from a group of probabilistic entities to satisfy a condition, wherein the top-rated probabilistic entities are rated on a criteria that is computed for a set of probabilistic entities that may satisfy the condition, comprising:
- (a) a memory in which the group of probabilistic entities are stored and in which a plurality of machine executable instructions are stored;
  
  (b) a user input for enabling a user to control the system and provide input data;
  
  (c) an output device for presenting information to a user; and
  
  (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including;
  
  (i) determining an initial range of criteria for each probabilistic entity in the set of probabilistic entities, wherein determining the initial range includes performing a probabilistic, simulation-based computation on each probabilistic entity in the set of entities;
  
  (ii) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each probabilistic entity;
  
  (iii) selecting a subset of probabilistic entities from the set on which to run further iterative computations to determine a refined range of criteria for each probabilistic entity of the subset of probabilistic entities, wherein selection of probabilistic entities to be included in the subset is based upon the range of criteria previously determined for the probabilistic entities, and wherein the subset of probabilistic entities includes an entity with first data and another entity with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate;
  
  (iv) repeating functions (ii) and (iii) until a current critical range does not include any portion of a refined range of criteria for any of the probabilistic entities in the subset, the number of probabilistic entities that are above the current critical range then comprising the number of top-rated probabilistic entities; and
  
  (v) presenting the number of top-rated probabilistic entities to a user with the output device,wherein the machine executable instructions further cause the processor to;
  
  set a lower critical bound for the current critical range of criteria based upon a top_krefined lower bound, determined by running the computations on the probabilistic entities, where the top_krefined lower bound is a k^thlargest refined lower bound of the probabilistic entities; and
  
  set an upper critical bound for the current critical range based upon a top_k+1refined upper bound for the probabilistic entities, determined by running the computations on the probabilistic entities, where the top_k+1refined upper bound is a k+1^thlargest refined upper bound of the probabilistic entities.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system of claim 6, wherein the machine executable instructions further cause the processor to rank the number of top-rated probabilistic entities by the range of criteria computed for each probabilistic entity.
  - 8. The system of claim 6, wherein the machine executable instructions further cause the processor to enable a user to terminate iterative repetition of steps (b) and (c) at any time, returning an ordered set of top-rated probabilistic entities determined up to that time, without regard to any specified number of probabilistic entities.
  - 9. The system of claim 6, wherein the machine executable instructions further cause the processor to select the subset of probabilistic entities by:
    - (a) selecting each probabilistic entity for which a lower bound of the refined criteria is less than a critical lower bound of the current critical range of criteria and an upper bound of the refined criteria is greater than a critical upper bound of the current critical range of criteria; and
      
      if no probabilistic entity is selected;
      
      then,(b) selecting each pair of probabilistic entities, wherein for a first probabilistic entity of the pair, the lower bound of the refined criteria is less than the critical lower bound, and for the second probabilistic entity of the pair, the upper bound of the refined criteria is greater than the critical upper bound of the current critical range of criteria; and
      
      , if no pair of probabilistic entities is thus selected;
      
      then,(c) selecting each probabilistic entity for which a range between the lower bound of the refined criteria and the upper bound of the refined criteria includes corresponding ranges of all other probabilistic entities.
  - 10. The system of claim 6, wherein the machine executable instructions further cause the processor to initially reduce an extent of the critical range of criteria before iteratively running the computations repetitively on each probabilistic entity in the subset, by statically evaluating groups of the probabilistic entities.

11. A computer-implemented method for efficiently determining a number k of top-rated probabilistic answers in response to a query of a database that includes imprecise data, so that each top-rated probabilistic answer is associated with a probability that the probabilistic answer is correct that is greater than that of all other probabilistic answers in a set of possible probabilistic answers to the query, and wherein determining the probability that an probabilistic answer is correct requires an unknown number of iterative computations, the method comprising the steps of:
- (a) repetitively running a computation on each possible probabilistic answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible probabilistic answer is correct, wherein running each computation includes performing a probabilistic, simulation-based computation on each possible probabilistic answer in the set;
  
  (b) selecting a current critical region between a critical lower bound and a critical upper bound of probability;
  
  (c) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible probabilistic answers and the critical lower bound and critical upper bound of the critical region, selecting possible probabilistic answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible probabilistic answer selected, wherein the imprecise data includes an probabilistic answer with first data and another probabilistic answer with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate;
  
  (d) iteratively repeating steps (b) and (c) until refined approximated lower bounds of each of k possible probabilistic answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible probabilistic answers are the k top-rated probabilistic answers to the query; and
  
  (e) presenting the k top-rated probabilistic answers to a user, wherein the step of selecting the current critical region comprises the steps of;
  
  setting the lower critical bound for the current critical region based upon a top_krefined lower bound determined by running the computations on the possible probabilistic answers, where the top_krefined lower bound is a k^thlargest refined lower bound of the probabilistic answers; and
  
  setting the upper critical bound for the current critical region based upon a top_k+1refined upper bound for the possible probabilistic answers, determined by running the computations on the possible probabilistic answers, where the top_k+1 refined upper bound is a k+1^thlargest refined upper bound of the probabilistic answers.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The method of claim 11, further comprising the step of ranking the k top-rated answers by the probability computed for each probabilistic answer.
  - 13. The method of claim 11, further comprising the step of enabling a user to terminate iterative repetition of steps (b) and (c) at any time, returning an ordered set of top-rated probabilistic answers determined up to that time, without regard to any specified number of probabilistic answers.
  - 14. The method of claim 11, wherein the step of selecting possible probabilistic answers for repetitively running the computations comprises the steps of:
    - (a) selecting each possible probabilistic answer for which the refined approximated lower bound is less than the critical lower bound and the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      if no possible probabilistic answer is selected;
      
      then,(b) selecting each pair of possible probabilistic answers wherein for a first possible probabilistic answer of the pair, the refined approximated lower bound is less than the critical lower bound, and for the second possible probabilistic answer of the pair, the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      , if no pair of possible probabilistic answers is thus selected;
      
      then,(c) selecting each possible probabilistic answer for which a range between the refined approximated lower bound and the refined approximated upper bound includes corresponding ranges of all other possible probabilistic answers.
  - 15. The method of claim 11, wherein the step of repetitively running the computation comprises the steps of:
    - (a) for each time the computation is run, randomly selecting a possible world for a possible probabilistic answer;
      
      (b) for each selected possible world, computing a truth value of a Boolean expression corresponding to the possible probabilistic answer;
      
      (c) determining a frequency with which the Boolean expression is true as a function of the number of times preceding steps (a)-(b) have been run;
      
      (d) determining a probability that each possible probabilistic answer is correct based upon the frequency; and
      
      (e) determining the approximated lower and upper bounds for the probability that each possible probabilistic answer is correct.
  - 16. The method of claim 11, further comprising the step of initially reducing a range between the critical lower bound and the critical upper bound of the critical region before running the computations repetitively on each possible probabilistic answer by a static evaluation of groups of possible probabilistic answers.
  - 17. The method of claim 11, wherein the step of determining the approximate lower bound and approximate upper bound are carried out by a query engine.

18. A system for efficiently determining a number k of top-rated probabilistic answers in response to a query of a database that includes imprecise data, so that each top-rated probabilistic answer is associated with a probability that the answer is correct that is greater than that of all other answers in a set of possible probabilistic answers to the query, and wherein determining the probability that an probabilistic answer is correct requires an unknown number of iterative computations, comprising:
- (a) a memory in which the imprecise data are stored and in which a plurality of machine executable instructions are stored, wherein running each computation includes performing a probabilistic, simulation-based computation on each possible probabilistic answer in the set;
  
  (b) a user input for enabling a user to control the system and provide input data;
  
  (c) an output device for presenting information to a user; and
  
  (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including;
  
  (i) repetitively running a computation on each possible probabilistic answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible probabilistic answer is correct;
  
  (ii) selecting a current critical region between a critical lower bound and a critical upper bound of probability;
  
  (iii) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible probabilistic answers and the critical lower bound and critical upper bound of the critical region, selecting possible probabilistic answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible probabilistic answer selected, wherein the imprecise data includes a probabilistic answer with first data and another probabilistic answer with second data that has a lower computed probability of being accurate than a computed probability of the first data being accurate;
  
  (iv) iteratively repeating functions (ii) and (iii) until refined approximated lower bounds of each of k possible answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible probabilistic answers are the k top-rated probabilistic answers to the query; and
  
  (v) presenting the k top-rated probabilistic answers to a user, wherein the machine executable instructions further cause the processor to;
  
  set the lower critical bound for the current critical region based upon a top_krefined lower bound determined by running the computations on the possible probabilistic answers, where the top_krefined lower bound is a k^thlargest refined lower bound of the probabilistic answers; and
  
  set the upper critical bound for the current critical region based upon a top_k+1refined upper bound for the possible probabilistic answers, determined by running the computations on the possible probabilistic answers, where the top_k+1refined upper bound is a k+1^thlargest refined upper bound of the probabilistic answers.
- View Dependent Claims (19, 20, 21)
- - 19. The system of claim 18, wherein the machine executable instructions further cause the processor to rank the k top-rated probabilistic answers by the probability computed for each probabilistic answer.
  - 20. The system of claim 18, wherein the machine executable instructions further cause the processor to enable a user to terminate iterative repetition of steps (ii) and (iii) at any time, returning an ordered set of top-rated probabilistic answers determined up to that time, without regard to any specified number of probabilistic answers.
  - 21. The system of claim 18, wherein the machine executable instructions further cause the processor to:
    - (a) select each possible probabilistic answer for which the refined approximated lower bound is less than the critical lower bound and the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      if no possible probabilistic answer is selected;
      
      then,(b) select each pair of possible probabilistic answers wherein for a first possible probabilistic answer of the pair, the refined approximated lower bound is less than the critical lower bound, and for the second possible probabilistic answer of the pair, the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      , if no pair of possible probabilistic answers is thus selected;
      
      then,(c) select each possible probabilistic answer for which a range between the refined approximated lower bound and the refined approximated upper bound includes corresponding ranges of all other possible probabilistic answers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Washington University In St Louis
Original Assignee
University of Washington
Inventors
Re, Christopher, Suciu, Dan
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
Le; Jessica N

Application Number

US11/935,230
Publication Number

US 20080109428A1
Time in Patent Office

1,072 Days
Field of Search

None
US Class Current

707/758
CPC Class Codes

G06F 16/3346 using probabilistic model

Efficient top-K query evaluation on probabilistic data

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient top-K query evaluation on probabilistic data

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links