EFFICIENT TOP-K QUERY EVALUATION ON PROBABILISTIC DATA

US 20080109428A1
Filed: 11/05/2007
Published: 05/08/2008
Est. Priority Date: 11/07/2006
Status: Active Grant

First Claim

Patent Images

1. A method for efficiently automatically determining a number of top-rated entities selected from a group of entities to satisfy a condition, wherein the top-rated entities are rated on a criteria that is computed for a set of entities that may satisfy the condition, comprising the steps of:

(a) determining an initial range of criteria for each entity in the set of entities;

(b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each entity;

(c) selecting a subset of entities from the set on which to run further iterative computations to determine a refined range of criteria for each entity of the subset of entities, wherein selection of entities to be included in the subset is based upon the range of criteria previously determined for the entities;

(d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the entities in the subset, the number of entities that are above the current critical range then comprising the number of top-rated entities; and

(e) presenting the number of top-rated entities to a user.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A novel approach that computes and efficiently ranks the top-k answers to a query on a probabilistic database. The approach identifies the top-k answers, since imprecisions in the data often lead to a large number of answers of low quality. The algorithm is used to run several Monte Carlo simulations in parallel, one for each candidate answer, and approximates the probability of each only to the extent needed to correctly determine the top-k answers. The algorithm is provably optimal and scales to large databases. A more general application can identify a number of top-rated entities of a group that satisfy a condition, based on a criteria or score computed for the entities. Also disclosed are several optimization techniques. One option is to rank the top-rated results; another option provides for interrupting the iteration to return the number of top-rated entities that have thus far been identified.

32 Citations

View as Search Results

25 Claims

1. A method for efficiently automatically determining a number of top-rated entities selected from a group of entities to satisfy a condition, wherein the top-rated entities are rated on a criteria that is computed for a set of entities that may satisfy the condition, comprising the steps of:
- (a) determining an initial range of criteria for each entity in the set of entities;
  
  (b) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each entity;
  
  (c) selecting a subset of entities from the set on which to run further iterative computations to determine a refined range of criteria for each entity of the subset of entities, wherein selection of entities to be included in the subset is based upon the range of criteria previously determined for the entities;
  
  (d) repeating steps (b) and (c) until a current critical range does not include any portion of a refined range of criteria for any of the entities in the subset, the number of entities that are above the current critical range then comprising the number of top-rated entities; and
  
  (e) presenting the number of top-rated entities to a user.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the step of computing the current critical range of criteria comprises the steps of:
    - (a) setting a lower critical bound for the current critical range of criteria based upon a top_krefined lower bound, determined by running the computations on the entities, where the top_krefined lower bound is a k^thlargest refined lower bound of the entities; and
      
      (b) setting an upper critical bound for the current critical range based upon a top_k+1refined upper bound for the entities, determined by running the computations on the entities, where the top_k+1refined upper bound is a k+1^thlargest refined upper bound of the entities.
  - 3. The method of claim 1, further comprising the step of ranking the number of top-rated entities by the range of criteria computed for each.
  - 4. The method of claim 1, further comprising the step of enabling a user to terminate iterative repetition of steps (b) and (c) at any time, returning an ordered set of top-rated entities determined up to that time, without regard to any specified number of entities.
  - 5. The method of claim 1, wherein the step of selecting the subset of entities for repetitively running the computations comprises the steps of:
    - (a) selecting each entity for which a lower bound of the refined criteria is less than a critical lower bound of the current critical range of criteria and an upper bound of the refined criteria is greater than a critical upper bound of the current critical range of criteria; and
      
      if no entity is selected;
      
      then,(b) selecting each pair of entities, wherein for a first entity of the pair, the lower bound of the refined criteria is less than the critical lower bound, and for the second entity of the pair, the upper bound of the refined criteria is greater than the critical upper bound of the current critical range of criteria; and
      
      , if no pair of entities is thus selected;
      
      then,(c) selecting each entity for which a range between the lower bound of the refined criteria and the upper bound of the refined criteria includes corresponding ranges of all other entities.
  - 6. The method of claim 1, further comprising the step of initially reducing an extent of the critical range of criteria before iteratively running the computations repetitively on each entity in the subset, by statically evaluating groups of the entities.

7. A system for efficiently automatically determining a number of top-rated entities selected from a group of entities to satisfy a condition, wherein the top-rated entities are rated on a criteria that is computed for a set of entities that may satisfy the condition, comprising:
- (a) a memory in which the group of entities are stored and in which a plurality of machine executable instructions are stored;
  
  (b) a user input for enabling a user to control the system and provide input data;
  
  (c) an output device for presenting information to a user; and
  
  (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including;
  
  (i) determining an initial range of criteria for each entity in the set of entities;
  
  (ii) computing a current critical range of criteria, based upon the ranges of criteria that were determined for each entity;
  
  (iii) selecting a subset of entities from the set on which to run further iterative computations to determine a refined range of criteria for each entity of the subset of entities, wherein selection of entities to be included in the subset is based upon the range of criteria previously determined for the entities;
  
  (iv) repeating functions (ii) and (iii) until a current critical range does not include any portion of a refined range of criteria for any of the entities in the subset, the number of entities that are above the current critical range then comprising the number of top-rated entities; and
  
  (v) presenting the number of top-rated entities to a user with the output device.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the machine executable instructions further cause the processor to:
    - (a) set a lower critical bound for the current critical range of criteria based upon a top_krefined lower bound, determined by running the computations on the entities, where the top_krefined lower bound is a k^thlargest refined lower bound of the entities; and
      
      (b) set an upper critical bound for the current critical range based upon a top_k+1refined upper bound for the entities, determined by running the computations on the entities, where the top_k+1refined upper bound is a k+1^thlargest refined upper bound of the entities.
  - 9. The system of claim 7, wherein the machine executable instructions further cause the processor to rank the number of top-rated entities by the range of criteria computed for each.
  - 10. The system of claim 7, wherein the machine executable instructions further cause the processor to enable a user to terminate iterative repetition of steps (b) and (c) at any time, returning an ordered set of top-rated entities determined up to that time, without regard to any specified number of entities.
  - 11. The system of claim 7, wherein the machine executable instructions further cause the processor to select the subset of entities by:
    - (a) selecting each entity for which a lower bound of the refined criteria is less than a critical lower bound of the current critical range of criteria and an upper bound of the refined criteria is greater than a critical upper bound of the current critical range of criteria; and
      
      if no entity is selected;
      
      then,(b) selecting each pair of entities, wherein for a first entity of the pair, the lower bound of the refined criteria is less than the critical lower bound, and for the second entity of the pair, the upper bound of the refined criteria is greater than the critical upper bound of the current critical range of criteria; and
      
      , if no pair of entities is thus selected;
      
      then,(c) selecting each entity for which a range between the lower bound of the refined criteria and the upper bound of the refined criteria includes corresponding ranges of all other entities.
  - 12. The system of claim 7, wherein the machine executable instructions further cause the processor to initially reduce an extent of the critical range of criteria before iteratively running the computations repetitively on each entity in the subset, by statically evaluating groups of the entities.

13. A method for efficiently determining a number k of top-rated answers in response to a query of a database that includes imprecise data, so that each top-rated answer is associated with a probability that the answer is correct that is greater than that of all other answers in a set of possible answers to the query, and wherein determining the probability that an answer is correct requires an unknown number of iterative computations, comprising the steps of:
- (a) repetitively running a computation on each possible answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible answer is correct;
  
  (b) selecting a current critical region between a critical lower bound and a critical upper bound of probability;
  
  (c) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible answers and the critical lower bound and critical upper bound of the critical region, selecting possible answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible answer selected;
  
  (d) iteratively repeating steps (b) and (c) until refined approximated lower bounds of each of k possible answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible answers are the k top-rated answers to the query; and
  
  (e) presenting the k top-rated answers to a user.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The method of claim 13, wherein the step of selecting the current critical region comprises the steps of:
    - (a) setting the lower critical bound for the current critical region based upon a top_krefined lower bound determined by running the computations on the possible answers, where the top_krefined lower bound is a k^thlargest refined lower bound of the answers; and
      
      (b) setting the upper critical bound for the current critical region based upon a top_k+1refined upper bound for the possible answers, determined by running the computations on the possible answers, where the top_k+1refined upper bound is a k+1^thlargest refined upper bound of the answers.
  - 15. The method of claim 13, further comprising the step of ranking the k top-rated answers by the probability computed for each.
  - 16. The method of claim 13, further comprising the step of enabling a user to terminate iterative repetition of steps (b) and (c) at any time, returning an ordered set of top-rated answers determined up to that time, without regard to any specified number of answers.
  - 17. The method of claim 13, wherein the step of selecting possible answers for repetitively running the computations comprises the steps of:
    - (a) selecting each possible answer for which the refined approximated lower bound is less than the critical lower bound and the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      if no possible answer is selected;
      
      then,(b) selecting each pair of possible answers wherein for a first possible answer of the pair, the refined approximated lower bound is less than the critical lower bound, and for the second possible answer of the pair, the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      , if no pair of possible answers is thus selected;
      
      then,(c) selecting each possible answer for which a range between the refined approximated lower bound and the refined approximated upper bound includes corresponding ranges of all other possible answers.
  - 18. The method of claim 13, wherein the step of repetitively running the computation comprises the steps of:
    - (a) for each time the computation is run, randomly selecting a possible world for a possible answer;
      
      (b) for each selected possible world, computing a truth value of a Boolean expression corresponding to the possible answer;
      
      (c) determining a frequency with which the Boolean expression is true as a function of the number of times preceding steps (a)-(b) have been run;
      
      (d) determining a probability that each possible answer is correct based upon the frequency; and
      
      (e) determining the approximated lower and upper bounds for the probability that each possible answer is correct.
  - 19. The method of claim 13, further comprising the step of initially reducing a range between the critical lower bound and the critical upper bound of the critical region before running the computations repetitively on each possible answer by a static evaluation of groups of possible answers.
  - 20. The method of claim 13, wherein the step of determining the approximate lower bound and approximate upper bound are carried out by a query engine.

21. A system for efficiently determining a number k of top-rated answers in response to a query of a database that includes imprecise data, so that each top-rated answer is associated with a probability that the answer is correct that is greater than that of all other answers in a set of possible answers to the query, and wherein determining the probability that an answer is correct requires an unknown number of iterative computations, comprising:
- (a) a memory in which the imprecise data are stored and in which a plurality of machine executable instructions are stored;
  
  (b) a user input for enabling a user to control the system and provide input data;
  
  (c) an output device for presenting information to a user; and
  
  (d) a processor that is coupled to the memory, the user input, and the output device, the processor executing the machine executable instructions in the memory to carry out a plurality of functions, including;
  
  (i) repetitively running a computation on each possible answer in the set for a predefined number of times, to compute an approximation of a lower bound and an upper bound for the probability that the possible answer is correct;
  
  (ii) selecting a current critical region between a critical lower bound and a critical upper bound of probability;
  
  (iii) based upon relative values of the approximations of the lower and upper bounds of probability computed for the possible answers and the critical lower bound and critical upper bound of the critical region, selecting possible answers for repetitively running further computations to determine a further refined lower bound and a further refined upper bound of probability for each possible answer selected;
  
  (iv) iteratively repeating functions (ii) and (iii) until refined approximated lower bounds of each of k possible answers are greater than or equal to the upper bound of a current critical region, indicating that said k possible answers are the k top-rated answers to the query; and
  
  (v) presenting the k top-rated answers to a user.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The system of claim 21, wherein the machine executable instructions further cause the processor to:
    - (a) set the lower critical bound for the current critical region based upon a top_krefined lower bound determined by running the computations on the possible answers, where the top_krefined lower bound is a k^thlargest refined lower bound of the answers; and
      
      (b) setting the upper critical bound for the current critical region based upon a top_k+1refined upper bound for the possible answers, determined by running the computations on the possible answers, where the top_k+1refined upper bound is a k+1^thlargest refined upper bound of the answers.
  - 23. The system of claim 21, wherein the machine executable instructions further cause the processor to rank the k top-rated answers by the probability computed for each.
  - 24. The system of claim 21, wherein the machine executable instructions further cause the processor to enable a user to terminate iterative repetition of steps (ii) and (iii) at any time, returning an ordered set of top-rated answers determined up to that time, without regard to any specified number of answers.
  - 25. The system of claim 21, wherein the machine executable instructions further cause the processor to:
    - (a) select each possible answer for which the refined approximated lower bound is less than the critical lower bound and the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      if no possible answer is selected;
      
      then,(b) select each pair of possible answers wherein for a first possible answer of the pair, the refined approximated lower bound is less than the critical lower bound, and for the second possible answer of the pair, the refined approximated upper bound is greater than the critical upper bound of the current critical region; and
      
      , if no pair of possible answers is thus selected;
      
      then,(c) select each possible answer for which a range between the refined approximated lower bound and the refined approximated upper bound includes corresponding ranges of all other possible answers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Washington University In St Louis
Original Assignee
University of Washington
Inventors
Re, Christopher, Suciu, Dan

Granted Patent

US 7,814,113 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/5
CPC Class Codes

G06F 16/3346 using probabilistic model

EFFICIENT TOP-K QUERY EVALUATION ON PROBABILISTIC DATA

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

32 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

EFFICIENT TOP-K QUERY EVALUATION ON PROBABILISTIC DATA

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

32 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links