Method and apparatus for performing pattern dictionary formation for use in sequence homology detection
First Claim
Patent Images
1. A computer-based method of processing a plurality of sequences in a database, the method comprising the steps of:
- evaluating each of the plurality of sequences including characters which form each sequence; and
generating at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.
1 Assignment
0 Petitions
Accused Products
Abstract
In a dictionary formation aspect of the invention, a computer-based method of processing a plurality of sequences in a database comprises the following steps. First, the method includes evaluating each of the plurality of sequences including characters which form each sequence. Then, at least one pattern of characters is generated representing at least a subset of the sequences in the database. The pattern has a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.
22 Citations
21 Claims
-
1. A computer-based method of processing a plurality of sequences in a database, the method comprising the steps of:
-
evaluating each of the plurality of sequences including characters which form each sequence; and
generating at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
is satisfied, wherein NB,K represents a number of patterns with a backbone structure B which has support K within the database D, XB,K represents a random variable defined over i processed versions of D, i=1, . . . , N, and threshold represents a predefined probability value.
-
-
5. The method of claim 4, wherein a pattern is represented by the expression:
-
6. The method of claim 5, wherein the backbone structure B of a pattern is a string over a numerical value set {1, 0} obtained when replacing characters of the pattern with a value of ‘
- 1’ and
the don'"'"'t care characters of the pattern by a value of ‘
0.’
- 1’ and
-
7. The method of claim 4, further comprising the step of generating each of the processed versions of D associated with the random variable XB,K by computing a random permutation of the characters in each sequence of D.
-
8. The method of claim 7, further comprising the step of computing a mean sB,K and a variance mB,K of the random variable XB,K.
-
9. The method of claim 8, further comprising the step of computing a constant value C as a function of NB,K and the mean sB,K and the variance mB,K of the random variable XB,K.
-
10. The method of claim 9, further comprising the step of computing an upper bound pB,K for the probability expression Pr[XB,K>
- NB,K] as the inverse of the square of the constant value C.
-
11. The method of claim 10, wherein the minimum support value is the smallest number K wherein maxB {pB,K}<
- threshold.
-
12. The method of claim 4, wherein threshold is selected to represent a confidence level associated with the minimum support value resulting from the inequality.
-
13. The method of claim 4, wherein the magnitude of threshold is inversely related to the magnitude of the minimum support value.
-
14. The method of claim 4, wherein the statistical significance of a pattern is directly related to the magnitude of the minimum support value.
-
15. The method of claim 1, further comprising the step of grouping two or more similar sequences in the database into a group prior to the evaluating step.
-
16. The method of claim 15, wherein two or more sequences are similar when, after alignment, a first sequence has at least a given percentage of characters in common with a second sequence.
-
17. The method of claim 16, wherein the longest sequence from the group is used in the evaluating step.
-
18. The method of claim 1, wherein the database includes sequences having both known and unknown sequence features.
-
19. The method of claim 1, wherein the sequences represent proteins.
-
20. Apparatus for processing a plurality of sequences in a database, the apparatus comprising:
at least one processor operative to;
(i) evaluate each of the plurality of sequences including characters which form each sequence; and
(ii) generate at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.
-
21. An article of manufacture for processing a plurality of sequences in a database, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
evaluating each of the plurality of sequences including characters which form each sequence; and
generating at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.
-
Specification