Method and apparatus for substring selectivity estimation
First Claim
Patent Images
1. A method for determining an estimate for string-occurrence probability in a database for a string, the method comprising:
- (a) receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
(b) obtaining an overall probability of occurrence;
(c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(d) obtaining a normalization factor; and
(e) dividing the overall probability of occurrence by the normalization factor to obtain the estimate.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for estimating string-occurrence probability in a database comprises receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string; obtaining an overall probability of occurrence; receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings; obtaining a normalization factor; and dividing the overall probability of occurrence by the normalization factor to obtain the estimate.
-
Citations
35 Claims
-
1. A method for determining an estimate for string-occurrence probability in a database for a string, the method comprising:
-
(a) receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
(b) obtaining an overall probability of occurrence;
(c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(d) obtaining a normalization factor; and
(e) dividing the overall probability of occurrence by the normalization factor to obtain the estimate. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
(f) using the obtained estimate for query-cost determination. -
3. The method of claim 1, wherein the overall probability of occurrence is obtained by multiplying together every probability of occurrence received in (a), and the normalization factor is obtained by multiplying together the probabilities received in (c).
-
4. The method of claim 3, wherein the database contains a number of strings, and the first probability of occurrence is defined as the number of strings in the database that contain the string, divided by a root count.
-
5. The method of claim 3, wherein the database is a pruned suffix tree.
-
6. The method of claim 3, further comprising the steps of:
-
(f) obtaining an upper limit on the string occurrence probability in a database for the string; and
(g) returning, as a new result, the smaller of the upper limit obtained in step (f) and the estimate obtained in step (e).
-
-
7. The method of claim 6, further comprising the step of:
(h) using the new result for query-cost determination.
-
8. The method of claim 6, wherein said obtaining an upper limit on the estimate obtained in step (e) includes:
-
(i) determining a maximal number of occurrences of the string based on the string'"'"'s constraints; and
(ii) dividing the maximal number of occurrences by a root count associated with the database, to produce an upper limit on the estimate.
-
-
9. The method of claim 8, wherein said determining a maximal number of occurrences includes applying a ConProj operator.
-
-
10. A method for determining an estimate for string-occurrence probability in a database for a string, the method comprising:
-
(a) multiplying together every probability of occurrence of each maximal substring from a plurality of maximal substrings to generate a product, each substring in the plurality of maximal substrings belonging to the string;
(b) multiplying together every probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings to obtain a normalization factor; and
(c) dividing the product by the normalization factor to obtain the estimate. - View Dependent Claims (11, 12, 13, 14)
(d) determining a maximal number of occurrences of the string based on the string'"'"'s constraints;
(e) dividing the maximal number of occurrences by a root count associated with the database to produce an upper limit on the second estimate; and
(f) returning, as a new result, the smaller of the upper limit obtained in (e) and the second estimate obtained in (c).
-
-
14. The method of claim 13, wherein said determining a maximal number of occurrences includes applying a ConProj operator.
-
15. A method for estimating the cost of a query to a database that contains a plurality of strings and substrings, the method comprising:
-
(a) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
(b) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (a);
(c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(d) obtaining a normalization factor by multiplying together the probabilities received in (c);
(e) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor; and
(f) using the string-occurrence probability for query-cost determination.
-
-
16. A method for determining the cost of a query to a database that contains a plurality of strings and substrings, the method comprising:
-
(a) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to a string;
(b) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (a);
(c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(d) obtaining a normalization factor by multiplying together the probabilities received in (c);
(e) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor;
(f) obtaining an upper limit on the string-occurrence probability for the string; and
(g) performing query-cost determination with the smaller of the upper limit obtained in step (f) and the string-occurrence probability obtained in step (e). - View Dependent Claims (17)
(i) determining a maximal number of occurrences of the string based on the string'"'"'s constraints; and
(ii) dividing the maximal number of occurrences by a root count associated with the database, to produce an upper limit on the string-occurrence probability.
-
-
18. An apparatus for estimating string-occurrence probability in a database for a string, the apparatus comprising:
-
(a) a processor;
(b) a port coupled to said processor; and
(c) a memory coupled to said processor, said memory storing instructions adapted to be executed on said processor, the instructions including;
(i) multiplying together every probability of occurrence of each maximal substring from a plurality of maximal substrings to generate a product, each substring in the plurality of maximal substrings belonging to the string;
(ii) multiplying together every probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings to obtain a normalization factor; and
(iii) dividing the product by the normalization factor to obtain the estimate. - View Dependent Claims (19, 20, 21, 22)
(iv) determining a maximal number of occurrences of the string based on the string'"'"'s constraints;
(v) dividing the maximal number of occurrences by a root count associated with the database to produce an upper limit on the second estimate; and
(vi) returning, as a new result, the smaller of the upper limit obtained in (e) and the second estimate obtained in (iii).
-
-
22. The apparatus of claim 21, wherein said determining a maximal number of occurrences includes applying a ConProj operator.
-
23. An apparatus for estimating string-occurrence probability in a database for a string, the apparatus comprising:
-
(a) a processor;
(b) a port coupled to said processor; and
(c) a memory coupled to said processor, said memory storing instructions adapted to be executed on said processor, the instructions including;
(i) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
(ii) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (i);
(iii) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(iv) obtaining a normalization factor by multiplying together the probabilities received in (iii);
(v) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor; and
(vi) using the string-occurrence probability for query-cost determination.
-
-
24. An apparatus for determining the cost of a query to a database that contains a plurality of strings and substrings, the apparatus comprising:
-
(a) a processor;
(b) a port coupled to said processor; and
(c) a memory coupled to said processor, said memory storing instructions adapted to be executed on said processor, the instructions including;
(i) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to a string;
(ii) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (i);
(iii) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(iv) obtaining a normalization factor by multiplying together the probabilities received in (iii);
(v) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor;
(vi) obtaining an upper limit on the string-occurrence probability for the string; and
(vii) performing query-cost determination with the smaller of the upper limit obtained in step (vi) and the string-occurrence probability obtained in step (v). - View Dependent Claims (25)
(i) determining a maximal number of occurrences of the string based on the string'"'"'s constraints; and
(ii) dividing the maximal number of occurrences by a root count associated with the database, to produce an upper limit on the string-occurrence probability.
-
-
26. A medium for determining an estimate for string-occurrence probability in a database for a string, the medium containing instructions adapted to be executed by a processor, the instructions comprising:
-
(a) receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
(b) obtaining an overall probability of occurrence;
(c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
(d) obtaining a normalization factor; and
(e) dividing the overall probability of occurrence by the normalization factor to obtain the estimate. - View Dependent Claims (27, 28, 29, 30, 31, 32)
(f) using the obtained estimate for query-cost determination.
-
-
28. The medium of claim 26, wherein the overall probability of occurrence is obtained by multiplying together every probability of occurrence received in (a), and the normalization factor is obtained by multiplying together the probabilities received in (c).
-
29. The medium of claim 28, wherein the database contains a number of strings, and the first probability of occurrence is defined as the number of strings in the database that contain the string, divided by a root count.
-
30. The medium of claim 28, further comprising the steps of:
-
(f) obtaining an upper limit on the string occurrence probability in a database for the string; and
(g) returning, as a new result, the smaller of the upper limit obtained in step (f) and the estimate obtained in step (e).
-
-
31. The medium of claim 30, further comprising the step of:
(h) using the new result for query-cost determination.
-
32. The medium of claim 30, wherein said obtaining an upper limit on the estimate obtained in step (e) includes:
-
(i) determining a maximal number of occurrences of the string based on the string'"'"'s constraints; and
(ii) dividing the maximal number of occurrences by a root count associated with the database, to produce an upper limit on the estimate.
-
-
33. A medium for determining an estimate for string-occurrence probability in a database for a string, the medium storing instructions adapted to be executed by a processor, the instructions comprising:
-
(a) multiplying together every probability of occurrence of each maximal substring from a plurality of maximal substrings to generate a product, each substring in the plurality of maximal substrings belonging to the string;
(b) multiplying together every probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings to obtain a normalization factor; and
(c) dividing the product by the normalization factor to obtain the estimate. - View Dependent Claims (34, 35)
(d) determining a maximal number of occurrences of the string based on the string'"'"'s constraints;
(e) dividing the maximal number of occurrences by a root count associated with the database to produce an upper limit on the second estimate; and
(f) returning, as a new result, the smaller of the upper limit obtained in (e) and the second estimate obtained in (c).
-
Specification