Methods and apparatus for ranking uncertain data in a probabilistic database

US 8,825,640 B2
Filed: 03/16/2009
Issued: 09/02/2014
Est. Priority Date: 03/16/2009
Status: Active Grant

First Claim

Patent Images

1. A method implemented by a database server to rank non-deterministic data stored in a database, the method comprising:

storing a set of data tuples representing a plurality of possible instantiations of the non-deterministic data in a database memory, a first data tuple in the set of data tuples taking on a first one of a set of possible data tuple instantiations selectable according to a first probability representing a likelihood of occurrence of the first one of the set of possible data tuple instantiations, respective ones of the possible instantiations of the non-deterministic data being realized when the database server selects a respective combination of possible data tuple instantiations for a combination of data tuples in the set of data tuples, the respective ones of the possible instantiations of the non-deterministic data being associated with respective instantiation probabilities;

in response to a query requesting a first number of data tuples collectively exhibiting a highest ranking among the set of data tuples, determining, using a processor, a ranking of the set of data tuples for output by the database server by determining an expected rank for the first data tuple, the expected rank for the first data tuple corresponding to a combination of multiple weighted component ranks for the first data tuple across the plurality of possible instantiations of the non-deterministic data, respective ones of the weighted component ranks for the first data tuple comprising a component rank, which represents a respective rank of the first data tuple in a respective one of the possible instantiations of the non-deterministic data, weighted by the respective instantiation probability for the respective one of the possible instantiations of the non-deterministic data; and

outputting a first ranked subset of the data tuples containing the first number of data tuples determined to exhibit the highest ranking among the set of data tuples, wherein the first ranked subset of the data tuples exhibits all of an exactness property, a containment property, a unique ranking property, a value invariance property and a stability property.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus for ranking uncertain data in a probabilistic database are disclosed. An example method disclosed herein comprises using a set of data tuples representing a plurality of possible data set instantiations associated with a respective plurality of instantiation probabilities to store non-deterministic data in a database, each data tuple corresponding to a set of possible data tuple instantiations, each data set instantiation realizable by selecting a respective data tuple instantiation for at least some of the data tuples, the method further comprising determining an expected rank for each data tuple included in at least a subset of the set of data tuples, the expected rank for a particular data tuple representing a combination of weighted component ranks of the particular data tuple, each component rank representing a ranking of the data tuple in a corresponding data set instantiation, each component ranking weighted by a respective instantiation probability.

41 Citations

View as Search Results

19 Claims

1. A method implemented by a database server to rank non-deterministic data stored in a database, the method comprising:
- storing a set of data tuples representing a plurality of possible instantiations of the non-deterministic data in a database memory, a first data tuple in the set of data tuples taking on a first one of a set of possible data tuple instantiations selectable according to a first probability representing a likelihood of occurrence of the first one of the set of possible data tuple instantiations, respective ones of the possible instantiations of the non-deterministic data being realized when the database server selects a respective combination of possible data tuple instantiations for a combination of data tuples in the set of data tuples, the respective ones of the possible instantiations of the non-deterministic data being associated with respective instantiation probabilities;
  
  in response to a query requesting a first number of data tuples collectively exhibiting a highest ranking among the set of data tuples, determining, using a processor, a ranking of the set of data tuples for output by the database server by determining an expected rank for the first data tuple, the expected rank for the first data tuple corresponding to a combination of multiple weighted component ranks for the first data tuple across the plurality of possible instantiations of the non-deterministic data, respective ones of the weighted component ranks for the first data tuple comprising a component rank, which represents a respective rank of the first data tuple in a respective one of the possible instantiations of the non-deterministic data, weighted by the respective instantiation probability for the respective one of the possible instantiations of the non-deterministic data; and
  
  outputting a first ranked subset of the data tuples containing the first number of data tuples determined to exhibit the highest ranking among the set of data tuples, wherein the first ranked subset of the data tuples exhibits all of an exactness property, a containment property, a unique ranking property, a value invariance property and a stability property.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method as defined in claim 1 wherein:
    - the first ranked subset of the data tuples exhibits the exactness property when the first ranked subset of the data tuples comprises exactly the first number of data tuples;
      
      the first ranked subset of the data tuples exhibits the containment property when the first ranked subset of the subset of the data tuples is included in a second ranked subset of the subset of the data tuples determined in response to a query requesting a second number of data tuples collectively exhibiting a highest ranking among the set of data tuples, the second number of data tuples being greater than the first number of data tuples;
      
      the first ranked subset of the data tuples exhibits the unique ranking property when each data tuple included in the first ranked subset of the data tuples is assigned a unique rank;
      
      the first ranked subset of the data tuples exhibits the value invariance property when an ordering of the data tuples included in the first ranked subset of the data tuples is to remain unchanged if a first score associated with the first data tuple is to be altered without changing an ordering of all scores associated with all data tuples included in the first ranked subset of the data tuples; and
      
      the first ranked subset of the data tuples exhibits the stability property when increasing a likelihood or an importance of the first data tuple relative to other data tuples included in the first ranked subset of the data tuples does not remove the first data tuple from the first ranked subset of the data tuples.
  - 3. The method as defined in claim 1 wherein each data tuple corresponds to a measurement obtained from a sensor, the measurement having a non-deterministic characteristic.
  - 4. The method as defined in claim 1 further comprising:
    - determining a first rank of the first data tuple, the first rank representative of a first ranking of the first data tuple when realized into a first one of the possible instantiations of the non-deterministic data;
      
      determining a second rank of the first data tuple, the second rank representative of a second ranking of the first data tuple when realized into a second one of the possible instantiations of the non-deterministic data; and
      
      combining the first rank weighted by a first instantiation probability with the second rank weighted by a second instantiation probability to determine the expected rank of the first data tuple across the plurality of possible instantiations of the non-deterministic data instantiations.
  - 5. The method as defined in claim 1 wherein each data tuple is associated with a respective set of scores paired with a respective set of score probabilities forming a respective set of score and score probability pairings, a particular score and score probability pairing representing a particular data tuple instantiation of the respective data tuple, a score in the particular pairing determined by evaluating a scoring function for a value of an uncertain attribute associated with the particular data tuple instantiation and a corresponding score probability in the particular pairing representing how likely the particular data tuple instantiation of the respective data tuple is to occur.
  - 6. The method as defined in claim 5 wherein the first data tuple is associated with a respective first set of scores paired with a respective first set of score probabilities, the method further comprising:
    - sorting a combined score set comprising all sets of scores associated with all data tuples in the set of data tuples to determine a sorted combined score set;
      
      for each score in the sorted combined score set and each data tuple in the set of data tuples, determining a respective comparison probability representing how likely the respective score is exceeded by the respective data tuple;
      
      for each score, summing the respective comparison probabilities for all data tuples to determine a comparison probability sum for the respective score; and
      
      summing the comparison probability sums corresponding to only the first set of scores weighted respectively by the first set of score probabilities to determine the expected rank for the first data tuple.
  - 7. The method as defined in claim 5 further comprising:
    - for each data tuple included in the set of data tuples, determining an expected score for the respective data tuple by combining the respective set of score and score probability pairings associated with the respective data tuple;
      
      selecting data tuples in decreasing order of expected score for inclusion in a second subset of the set of data tuples;
      
      upon each data tuple selection, determining an upper bound for an expected rank for each selected data tuple based on a smallest expect score among all selected data tuples and a size of the second subset of the set of data tuples;
      
      upon each data tuple selection, determining a lower bound for all expected ranks for all unselected data tuples based on the smallest expect score among all selected data tuples and the size of the second subset of the set of data tuples; and
      
      stopping selection of data tuples for inclusion in the second subset of the set of data tuples when the determined lower bound for all expected ranks for all unselected data tuples exceeds the determined upper bound for the expected rank for a particular selected data tuple.
  - 8. The method as defined in claim 7 further comprising:
    - determining a curtailed set of data tuples comprising all selected data tuples but not any unselected data tuples; and
      
      determining an approximate expected rank for each data tuple in the curtailed set of data tuples.
  - 9. The method as defined in claim 1 wherein each data tuple is associated with a respective score and a respective score probability, the respective score determined by evaluating a scoring function for a value of an uncertain attribute associated with the particular data tuple, the respective score probability representing how likely the respective data tuple will be included in the plurality of possible instantiations of the non-deterministic data, the first set of possible data tuple instantiations for the first data tuple including (1) a first possible data tuple instantiation corresponding to the first data tuple being selected for inclusion in one of the possible instantiations of the non-deterministic data, and (2) a second possible data tuple instantiation corresponding to the first data tuple not being selected for inclusion in the one of the possible instantiations of the non-deterministic data.
  - 10. The method as defined in claim 9 further comprising:
    - using a set of exclusion rules to determine which data tuples are included in each of the plurality of possible instantiations of the non-deterministic data, each data tuple being included in only one exclusion rule, each exclusion rule including one or more data tuples, any pair of data tuples occurring in a particular exclusion rule not being included together in any of the possible instantiations of the non-deterministic data;
      
      sorting the set of data tuples in decreasing order of score to determine a sorted set of data tuples;
      
      for each data tuple in the sorted set of data tuples, summing the score probabilities associated with all data tuples ordered before the respective data tuple in decreasing order of score to determine a first quantity;
      
      summing the respective score probabilities associated with all data tuples to determine a second quantity; and
      
      for the first data tuple, combining the first quantity and the second quantity with score probabilities associated with at least some data tuples included with the first data tuple in a first exclusion rule to determine the expected rank for the first data tuple.
  - 11. The method as defined in claim 10 wherein the first data tuple corresponds to a highest score in the sorted set of data tuples, the method further comprising:
    - in decreasing order of score, selecting data tuples from the sorted set of data tuples for inclusion in a second subset of the set of data tuples;
      
      upon data tuple selection, determining an expected rank for each selected data tuple by combining the first quantity and the second quantity with score probabilities associated with at least some of the data tuples included in an exclusion rule also including the selected data tuple;
      
      upon data tuple selection, summing the score probabilities associated with the respective data tuples ordered before the selected data tuple in the sorted set of data tuples to determine a lower bound for all expected ranks for all unselected data tuples in the sorted set of data tuples; and
      
      stopping selection of data tuples for inclusion in the second subset of the set of data tuples when the determined lower bound for all expected ranks for all unselected data tuples exceeds the determined expected rank for a particular selected data tuple.

12. A tangible computer readable memory comprising computer-readable instructions, which, when executed, cause a database server to perform operations comprising:
- storing a set of data tuples representing a plurality of possible instantiations of the non-deterministic data in a database memory, a first data tuple in the set of data tuples taking on a first one of a set of possible data tuple instantiations selectable according to a first probability representing a likelihood of occurrence of the first one of the set of possible data tuple instantiations, respective ones of the possible instantiations of the non-deterministic data being realized upon selection of a respective combination of possible data tuple instantiations for a combination of data tuples in the set of data tuples, the respective ones of the possible instantiations of the non-deterministic data being associated with respective instantiation probabilities;
  
  determining a ranking of the set of data tuples for output by the database server in response to a query requesting a first number of data tuples collectively exhibiting a highest ranking among the set of data tuples, wherein the determining the ranking comprises determining an expected rank for the first data tuple, the expected rank for the first data tuple corresponding to a combination of multiple weighted component ranks for the first data tuple across the plurality of possible instantiations of the non-deterministic data, respective ones of the weighted component ranks for the first data tuple comprising a component rank, which represents a respective rank of the first data tuple in a respective one of the possible instantiations of the non-deterministic data, weighted by the respective instantiation probability for the respective one of the possible instantiations of the non-deterministic data; and
  
  outputting a first ranked subset of the data tuples containing the first number of data tuples determined to exhibit the highest ranking among the set of data tuples, wherein the first ranked subset of the data tuples exhibits all of an exactness property, a containment property, a unique ranking property, a value invariance property and a stability property.
- View Dependent Claims (13, 14, 15)
- - 13. The tangible computer readable memory as defined in claim 12 wherein:
    - the first ranked subset of the data tuples exhibits the exactness property when the first ranked subset of the data tuples comprises exactly the first number of data tuples;
      
      the first ranked subset of the data tuples exhibits the containment property when the first ranked subset of the subset of the data tuples is included in a second ranked subset of the subset of the data tuples determined in response to a query requesting a second number of data tuples collectively exhibiting a highest ranking among the set of data tuples, the second number of data tuples being greater than the first number of data tuples;
      
      the first ranked subset of the data tuples exhibits the unique ranking property when each data tuple included in the first ranked subset of the data tuples is assigned a unique rank;
      
      the first ranked subset of the data tuples exhibits the value invariance property when an ordering of the data tuples included in the first ranked subset of the data tuples is to remain unchanged if a first score associated with the first data tuple is to be altered without changing an ordering of all scores associated with all data tuples included in the first ranked subset of the data tuples; and
      
      the first ranked subset of the data tuples exhibits the stability property when increasing a likelihood or an importance of the first data tuple relative to other data tuples included in the first ranked subset of the data tuples does not remove the first data tuple from the first ranked subset of the data tuples.
  - 14. The tangible computer readable memory as defined in claim 12 wherein each data tuple is associated with a respective set of scores paired with a respective set of score probabilities forming a respective set of score and score probability pairings, a particular score and score probability pairing representing a particular data tuple instantiation of the respective data tuple, a score in the particular pairing determined by evaluating a scoring function for a value of an uncertain attribute associated with the particular data tuple instantiation and a corresponding score probability in the particular pairing representing how likely the particular data tuple instantiation of the respective data tuple is to occur, and wherein the operations further comprise:
    - selecting a respective first score and a respective first score probability for each data tuple to realize a first one of the possible instantiations of the non-deterministic data;
      
      selecting a respective second score and a respective second score probability for each data tuple to realize a second one of the possible instantiations of the non-deterministic data;
      
      combining the respective first probabilities for all data tuples in the set of data tuples to determine a first instantiation probability;
      
      combining the respective second probabilities for all data tuples in the set of data tuples to determine a second instantiation probability;
      
      determining a first rank of the first data tuple, the first rank representative of a first ranking of the first data tuple when realized into the first one of the possible instantiations of the non-deterministic data;
      
      determining a second rank of the first data tuple, the second rank representative of a second ranking of the first data tuple when realized into the second one of the possible instantiations of the non-deterministic data; and
      
      combining the first rank weighted by the first instantiation probability with the second rank weighted by the second instantiation probability to determine the expected rank of the first data tuple across the plurality of possible instantiations of the non-deterministic data.
  - 15. The tangible computer readable memory as defined in claim 12 wherein each data tuple is associated with a respective score and a respective score probability, the respective score determined by evaluating a scoring function for a value of an uncertain attribute associated with the particular data tuple, the respective score probability representing how likely the respective data tuple will be included in the plurality of possible instantiations of the non-deterministic data, the first set of possible data tuple instantiations for the first data tuple including (1) a first possible data tuple instantiation corresponding to the first data tuple being selected for inclusion in one of the possible instantiations of the non-deterministic data, and (2) a second possible data tuple instantiation corresponding to the first data tuple not being selected for inclusion in the one of the possible instantiations of the non-deterministic data, and wherein operations further comprise:
    - using a set of exclusion rules to determine which data tuples are included in each of the plurality of possible instantiations of the non-deterministic data, each data tuple being included in only one exclusion rule, each exclusion rule including one or more data tuples, any pair of data tuples occurring in a particular exclusion rule not being included together in any of the possible instantiations of the non-deterministic data;
      
      for the first data tuple, summing the score probabilities associated with the data tuples ordered after the first data tuple in decreasing order of score to determine a first quantity;
      
      summing the respective score probabilities associated with all data tuples to determine a second quantity; and
      
      for the first data tuple, combining the first quantity and the second quantity with score probabilities associated with at least some data tuples included with the first data tuple in a first exclusion rule to determine the expected rank for the first data tuple.

16. A database server for use in ranking non-deterministic data, the database server comprising:
- a memory having machine readable instructions stored thereon;
  
  a probabilistic database to store a set of data tuples representing a plurality of possible instantiations of the non-deterministic data, a first data tuple in the set of data tuples taking on a first one of a set of possible data tuple instantiations selectable according to a first probability representing a likelihood of occurrence of the first one of the set of possible data tuple instantiations, respective ones of the possible instantiations of the non-deterministic data being realized upon selection of a respective combination of possible data tuple instantiations for a combination of data tuples in the set of data tuples, the respective ones of the possible instantiations of the non-deterministic data being associated with respective instantiation probabilities; and
  
  an expected ranking processor to execute the machine readable instructions to perform operations comprising;
  
  determining a ranking of the set of data tuples for output by the database server in response to a query requesting a first number of data tuples collectively exhibiting a highest ranking among the set of data tuples, wherein the determining the ranking comprises determining an expected rank for the first data tuple, the expected rank for the first data tuple corresponding to a combination of multiple weighted component ranks for the first data tuple across the plurality of possible instantiations of the non-deterministic data, respective ones of the weighted component ranks for the first data tuple comprising a component rank, which represents a respective rank of the first data tuple in a respective one of the possible instantiations of the non-deterministic data, weighted by the respective instantiation probability for the respective one of the possible instantiations of the non-deterministic data; and
  
  outputting a first ranked subset of the data tuples containing the first number of data tuples determined to exhibit the highest ranking among the set of data tuples, wherein the first ranked subset of the data tuples exhibits all of an exactness property, a containment property, a unique ranking property, a value invariance property and a stability property.
- View Dependent Claims (17, 18, 19)
- - 17. The database server as defined in claim 16 wherein:
    - the first ranked subset of the data tuples exhibits the exactness property when the first ranked subset of the data tuples comprises exactly the first number of data tuples;
      
      the first ranked subset of the data tuples exhibits the containment property when the first ranked subset of the subset of the data tuples is included in a second ranked subset of the subset of the data tuples determined in response to a query requesting a second number of data tuples collectively exhibiting a highest ranking among the set of data tuples, the second number of data tuples being greater than the first number of data tuples;
      
      the first ranked subset of the data tuples exhibits the unique ranking property when each data tuple included in the first ranked subset of the data tuples is assigned a unique rank;
      
      the first ranked subset of the data tuples exhibits the value invariance property when an ordering of the data tuples included in the first ranked subset of the data tuples is to remain unchanged if a first score associated with the first data tuple is to be altered without changing an ordering of all scores associated with all data tuples included in the first ranked subset of the data tuples; and
      
      the first ranked subset of the data tuples exhibits the stability property when increasing a likelihood or an importance of the first data tuple relative to other data tuples included in the first ranked subset of the data tuples does not remove the first data tuple from the first ranked subset of the data tuples.
  - 18. The database server as defined in claim 16 wherein each data tuple is associated with a respective set of scores paired with a respective set of score probabilities forming a respective set of score and score probability pairings, a particular score and score probability pairing representing a particular data tuple instantiation of the respective data tuple, a score in the particular pairing determined by evaluating a scoring function for a value of an uncertain attribute associated with the particular data tuple instantiation and a corresponding score probability in the particular pairing representing how likely the particular data tuple instantiation of the respective data tuple is to occur, and wherein the operations further comprise:
    - selecting a respective first score and a respective first score probability for each data tuple to realize a first one of the possible instantiations of the non-deterministic data;
      
      selecting a respective second score and a respective second score probability for each data tuple to realize a second one of the possible instantiations of the non-deterministic data;
      
      combining the respective first probabilities for all data tuples in the set of data tuples to determine a first instantiation probability;
      
      combining the respective second probabilities for all data tuples in the set of data tuples to determine a second instantiation probability;
      
      determining a first rank of the first data tuple, the first rank representative of a first ranking of the first data tuple when realized into the first one of the possible instantiations of the non-deterministic data;
      
      determining a second rank of the first data tuple, the second rank representative of a second ranking of the first data tuple when realized into the second one of the possible instantiations of the non-deterministic data; and
      
      combining the first rank weighted by the first instantiation probability with the second rank weighted by the second instantiation probability to determine the expected rank of the first data tuple across the plurality of possible instantiations of the non-deterministic data.
  - 19. The database server as defined in claim 16 wherein each data tuple is associated with a respective score and a respective score probability, the respective score determined by evaluating a scoring function for a value of an uncertain attribute associated with the particular data tuple, the respective score probability representing how likely the respective data tuple will be included in the plurality of possible instantiations of the non-deterministic data, the first set of possible data tuple instantiations for the first data tuple including (1) a first possible data tuple instantiation corresponding to the first data tuple being selected for inclusion in one of the possible instantiations of the non-deterministic data, and (2) a second possible data tuple instantiation corresponding to the first data tuple not being selected for inclusion in the one of the possible instantiations of the non-deterministic data, and wherein operations further comprise:
    - using a set of exclusion rules to determine which data tuples are included in each of the possible instantiations of the non-deterministic data, each data tuple being included in only one exclusion rule, each exclusion rule including one or more data tuples, any pair of data tuples occurring in a particular exclusion rule not being included together in any of the possible instantiations of the non-deterministic data;
      
      for the first data tuple, summing the score probabilities associated with the data tuples ordered after the first data tuple in decreasing order of score to determine a first quantity;
      
      summing the respective score probabilities associated with all data tuples to determine a second quantity; and
      
      for the first data tuple, combining the first quantity and the second quantity with score probabilities associated with at least some data tuples included with the first data tuple in a first exclusion rule to determine the expected rank for the first data tuple.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Cormode, Graham, Li, Feifei, Yi, Ke
Primary Examiner(s)
Shmatov, Alexey
Assistant Examiner(s)
CHEUNG, HUBERT G

Application Number

US12/404,906
Publication Number

US 20100235362A1
Time in Patent Office

1,996 Days
Field of Search
US Class Current

707/723
CPC Class Codes

G06F 16/20 of structured data, e.g. re...

G06F 16/24578 using ranking

Methods and apparatus for ranking uncertain data in a probabilistic database

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

41 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Methods and apparatus for ranking uncertain data in a probabilistic database

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others