Method and apparatus for substring selectivity estimation

US 6,401,088 B1
Filed: 12/30/1999
Issued: 06/04/2002
Est. Priority Date: 02/09/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method for determining an estimate for string-occurrence probability in a database for a string, the method comprising:

(a) receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;

(b) obtaining an overall probability of occurrence;

(c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;

(d) obtaining a normalization factor; and

(e) dividing the overall probability of occurrence by the normalization factor to obtain the estimate.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for estimating string-occurrence probability in a database comprises receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string; obtaining an overall probability of occurrence; receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings; obtaining a normalization factor; and dividing the overall probability of occurrence by the normalization factor to obtain the estimate.

Citations

35 Claims

1. A method for determining an estimate for string-occurrence probability in a database for a string, the method comprising:
- (a) receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
  
  (b) obtaining an overall probability of occurrence;
  
  (c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
  
  (d) obtaining a normalization factor; and
  
  (e) dividing the overall probability of occurrence by the normalization factor to obtain the estimate.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising the step of
- 3. The method of claim 1, wherein the overall probability of occurrence is obtained by multiplying together every probability of occurrence received in (a), and the normalization factor is obtained by multiplying together the probabilities received in (c).
- 4. The method of claim 3, wherein the database contains a number of strings, and the first probability of occurrence is defined as the number of strings in the database that contain the string, divided by a root count.
- 5. The method of claim 3, wherein the database is a pruned suffix tree.
- 6. The method of claim 3, further comprising the steps of:
  - (f) obtaining an upper limit on the string occurrence probability in a database for the string; and
    
    (g) returning, as a new result, the smaller of the upper limit obtained in step (f) and the estimate obtained in step (e).
- 7. The method of claim 6, further comprising the step of:
  - (h) using the new result for query-cost determination.
- 8. The method of claim 6, wherein said obtaining an upper limit on the estimate obtained in step (e) includes:
  - (i) determining a maximal number of occurrences of the string based on the string'"'"'s constraints; and
    
    (ii) dividing the maximal number of occurrences by a root count associated with the database, to produce an upper limit on the estimate.
- 9. The method of claim 8, wherein said determining a maximal number of occurrences includes applying a ConProj operator.

10. A method for determining an estimate for string-occurrence probability in a database for a string, the method comprising:
- (a) multiplying together every probability of occurrence of each maximal substring from a plurality of maximal substrings to generate a product, each substring in the plurality of maximal substrings belonging to the string;
  
  (b) multiplying together every probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings to obtain a normalization factor; and
  
  (c) dividing the product by the normalization factor to obtain the estimate.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The method of claim 10, wherein the database contains a number of strings and a root count, and wherein the probability of occurrence is defined as the number of strings in the database that contain the string, divided by the root count.
  - 12. The method of claim 11, wherein the database is a pruned suffix tree.
  - 13. The method of claim 11, further comprising the steps of:
14. The method of claim 13, wherein said determining a maximal number of occurrences includes applying a ConProj operator.

15. A method for estimating the cost of a query to a database that contains a plurality of strings and substrings, the method comprising:
- (a) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
  
  (b) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (a);
  
  (c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
  
  (d) obtaining a normalization factor by multiplying together the probabilities received in (c);
  
  (e) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor; and
  
  (f) using the string-occurrence probability for query-cost determination.

16. A method for determining the cost of a query to a database that contains a plurality of strings and substrings, the method comprising:
- (a) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to a string;
  
  (b) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (a);
  
  (c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
  
  (d) obtaining a normalization factor by multiplying together the probabilities received in (c);
  
  (e) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor;
  
  (f) obtaining an upper limit on the string-occurrence probability for the string; and
  
  (g) performing query-cost determination with the smaller of the upper limit obtained in step (f) and the string-occurrence probability obtained in step (e).
- View Dependent Claims (17)
- - 17. The method of claim 16, wherein said obtaining an upper limit on the string-occurrence probability includes:

18. An apparatus for estimating string-occurrence probability in a database for a string, the apparatus comprising:
- (a) a processor;
  
  (b) a port coupled to said processor; and
  
  (c) a memory coupled to said processor, said memory storing instructions adapted to be executed on said processor, the instructions including;
  
  (i) multiplying together every probability of occurrence of each maximal substring from a plurality of maximal substrings to generate a product, each substring in the plurality of maximal substrings belonging to the string;
  
  (ii) multiplying together every probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings to obtain a normalization factor; and
  
  (iii) dividing the product by the normalization factor to obtain the estimate.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The apparatus of claim 18, wherein the database contains a number of strings and a root count, and wherein the probability of occurrence is defined as the number of strings in the database that contain the string, divided by the root count.
  - 20. The apparatus of claim 18, wherein the database is a pruned suffix tree.
  - 21. The apparatus of claim 18, said memory storing further instructions adapted to be executed on said processor, said further instructions comprising:
22. The apparatus of claim 21, wherein said determining a maximal number of occurrences includes applying a ConProj operator.

23. An apparatus for estimating string-occurrence probability in a database for a string, the apparatus comprising:
- (a) a processor;
  
  (b) a port coupled to said processor; and
  
  (c) a memory coupled to said processor, said memory storing instructions adapted to be executed on said processor, the instructions including;
  
  (i) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
  
  (ii) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (i);
  
  (iii) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
  
  (iv) obtaining a normalization factor by multiplying together the probabilities received in (iii);
  
  (v) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor; and
  
  (vi) using the string-occurrence probability for query-cost determination.

24. An apparatus for determining the cost of a query to a database that contains a plurality of strings and substrings, the apparatus comprising:
- (a) a processor;
  
  (b) a port coupled to said processor; and
  
  (c) a memory coupled to said processor, said memory storing instructions adapted to be executed on said processor, the instructions including;
  
  (i) receiving a probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to a string;
  
  (ii) obtaining an overall probability of occurrence by multiplying together every probability of occurrence received in step (i);
  
  (iii) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
  
  (iv) obtaining a normalization factor by multiplying together the probabilities received in (iii);
  
  (v) obtaining a string-occurrence probability for the string in the database by dividing the overall probability of occurrence by the normalization factor;
  
  (vi) obtaining an upper limit on the string-occurrence probability for the string; and
  
  (vii) performing query-cost determination with the smaller of the upper limit obtained in step (vi) and the string-occurrence probability obtained in step (v).
- View Dependent Claims (25)
- - 25. The apparatus of claim 24, wherein said obtaining an upper limit on the string-occurrence probability includes:

26. A medium for determining an estimate for string-occurrence probability in a database for a string, the medium containing instructions adapted to be executed by a processor, the instructions comprising:
- (a) receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string;
  
  (b) obtaining an overall probability of occurrence;
  
  (c) receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings;
  
  (d) obtaining a normalization factor; and
  
  (e) dividing the overall probability of occurrence by the normalization factor to obtain the estimate.
- View Dependent Claims (27, 28, 29, 30, 31, 32)
- - 27. The medium of claim 26, the medium containing further instructions adapted to be executed by a processor, the further instructions comprising:
28. The medium of claim 26, wherein the overall probability of occurrence is obtained by multiplying together every probability of occurrence received in (a), and the normalization factor is obtained by multiplying together the probabilities received in (c).
29. The medium of claim 28, wherein the database contains a number of strings, and the first probability of occurrence is defined as the number of strings in the database that contain the string, divided by a root count.
30. The medium of claim 28, further comprising the steps of:
- (f) obtaining an upper limit on the string occurrence probability in a database for the string; and
  
  (g) returning, as a new result, the smaller of the upper limit obtained in step (f) and the estimate obtained in step (e).
31. The medium of claim 30, further comprising the step of:
- (h) using the new result for query-cost determination.
32. The medium of claim 30, wherein said obtaining an upper limit on the estimate obtained in step (e) includes:
- (i) determining a maximal number of occurrences of the string based on the string'"'"'s constraints; and
  
  (ii) dividing the maximal number of occurrences by a root count associated with the database, to produce an upper limit on the estimate.

33. A medium for determining an estimate for string-occurrence probability in a database for a string, the medium storing instructions adapted to be executed by a processor, the instructions comprising:
- (a) multiplying together every probability of occurrence of each maximal substring from a plurality of maximal substrings to generate a product, each substring in the plurality of maximal substrings belonging to the string;
  
  (b) multiplying together every probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings to obtain a normalization factor; and
  
  (c) dividing the product by the normalization factor to obtain the estimate.
- View Dependent Claims (34, 35)
- - 34. The medium of claim 33, wherein the database contains a number of strings and a root count, and wherein the probability of occurrence is defined as the number of strings in the database that contain the string, divided by the root count.
  - 35. The medium of claim 34, further comprising the steps of:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Corporation (AT&T, Inc.)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Jagadish, Hosagrahar Visvesvaraya, Ng, Raymond, Srivastava, Divesh
Primary Examiner(s)
AMSBURY, WAYNE P

Application Number

US09/476,715
Time in Patent Office

887 Days
Field of Search

707/6
US Class Current

707/719
CPC Class Codes

G06F 16/2448   for particular applications...

G06F 16/24545   Selectivity estimation or d...

G06F 17/18   for evaluating statistical ...

Y10S 707/99936   Pattern matching access

Method and apparatus for substring selectivity estimation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for substring selectivity estimation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links