Method and apparatus for performing pattern dictionary formation for use in sequence homology detection

US 6,571,199 B1
Filed: 06/21/2000
Issued: 05/27/2003
Est. Priority Date: 10/30/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer-based method of processing a plurality of sequences in a database, the method comprising the steps of:

evaluating each of the plurality of sequences including characters which form each sequence; and

generating at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a dictionary formation aspect of the invention, a computer-based method of processing a plurality of sequences in a database comprises the following steps. First, the method includes evaluating each of the plurality of sequences including characters which form each sequence. Then, at least one pattern of characters is generated representing at least a subset of the sequences in the database. The pattern has a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.

22 Citations

View as Search Results

21 Claims

1. A computer-based method of processing a plurality of sequences in a database, the method comprising the steps of:
- evaluating each of the plurality of sequences including characters which form each sequence; and
  
  generating at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, further comprising the step of collecting the generated patterns having the minimum support value into a set for use in comparison with a query sequence to detect one or more homologies between the query sequence and one or more sequences in the database.
  - 3. The method of claim 1, wherein the statistical significance of a pattern further depends on at least one of a length parameter and a width parameter associated with the pattern.
  - 4. The method of claim 1, wherein the minimum support value, represented as K_min, is calculated to be the first number K for which an inequality represented as:
    - $\max_{B} {\Pr [X_{B, K} \geq N_{B, K}]} \leq threshold$
5. The method of claim 4, wherein a pattern is represented by the expression:
6. The method of claim 5, wherein the backbone structure B of a pattern is a string over a numerical value set {1, 0} obtained when replacing characters of the pattern with a value of ‘
- 1’ and
  
  the don'"'"'t care characters of the pattern by a value of ‘
  
  0.’
7. The method of claim 4, further comprising the step of generating each of the processed versions of D associated with the random variable X_B,Kby computing a random permutation of the characters in each sequence of D.
8. The method of claim 7, further comprising the step of computing a mean s_B,Kand a variance m_B,Kof the random variable X_B,K.
9. The method of claim 8, further comprising the step of computing a constant value C as a function of N_B,Kand the mean s_B,Kand the variance m_B,Kof the random variable X_B,K.
10. The method of claim 9, further comprising the step of computing an upper bound p_B,Kfor the probability expression Pr[X_B,K>
- N_B,K] as the inverse of the square of the constant value C.
11. The method of claim 10, wherein the minimum support value is the smallest number K wherein max_B{p_B,K}<
- threshold.
12. The method of claim 4, wherein threshold is selected to represent a confidence level associated with the minimum support value resulting from the inequality.
13. The method of claim 4, wherein the magnitude of threshold is inversely related to the magnitude of the minimum support value.
14. The method of claim 4, wherein the statistical significance of a pattern is directly related to the magnitude of the minimum support value.
15. The method of claim 1, further comprising the step of grouping two or more similar sequences in the database into a group prior to the evaluating step.
16. The method of claim 15, wherein two or more sequences are similar when, after alignment, a first sequence has at least a given percentage of characters in common with a second sequence.
17. The method of claim 16, wherein the longest sequence from the group is used in the evaluating step.
18. The method of claim 1, wherein the database includes sequences having both known and unknown sequence features.
19. The method of claim 1, wherein the sequences represent proteins.

20. Apparatus for processing a plurality of sequences in a database, the apparatus comprising:
- at least one processor operative to;
  
  (i) evaluate each of the plurality of sequences including characters which form each sequence; and
  
  (ii) generate at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.

21. An article of manufacture for processing a plurality of sequences in a database, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
- evaluating each of the plurality of sequences including characters which form each sequence; and
  
  generating at least one pattern of characters representing at least a subset of the sequences in the database, the pattern having a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Rigoutsos, Isidore, Floratos, Aris
Primary Examiner(s)
Hoff, Marc S.
Assistant Examiner(s)
Raymond, Edward

Application Number

US09/582,045
Time in Patent Office

1,070 Days
Field of Search

702/179, 702/180, 702/182, 930/10, 930/20, 930/30, 930/290, 930/300, 930/310
US Class Current

702/179
CPC Class Codes

G06F 18/28   Determining representative ...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

Y10S 707/959   Network

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Y10S 930/31   Linker sequence

Method and apparatus for performing pattern dictionary formation for use in sequence homology detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

22 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for performing pattern dictionary formation for use in sequence homology detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links