Method for the identification of gene transcripts with improved efficiency in the treatment of errors

US 7,101,665 B2
Filed: 12/27/2000
Issued: 09/05/2006
Est. Priority Date: 12/27/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identification of gene transcripts comprising the steps of:

a) generating at least a first set of raw sequences by sequencing at least a first type of biological material;

b) isolating first ditags from said at least first set of raw sequences;

c) isolating first tags from said isolated at least first ditags;

d) determining abundance of said first tags; and

e) identifying said first tags, further comprising a step off) rejecting said isolated first tags that are wrongly sequenced by means of a statistical model for sequencing errors to be applied to said isolated first tags, said statistical model being defined by a probability function F(a,b), wherein said function F is intended to modelize the probability that a given tag a can be sequenced as b.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for identification of gene transcripts comprises the steps of: generating a set of raw sequences by sequencing of biological material; isolating ditags from the set of raw sequences; isolating tags from the ditags; determining abundance of the tags; and identifying the tags, the method providing a step of reducing the amount of sequencing errors by using a statistical model for sequencing errors.

Citations

29 Claims

1. A method for identification of gene transcripts comprising the steps of:
- a) generating at least a first set of raw sequences by sequencing at least a first type of biological material;
  
  b) isolating first ditags from said at least first set of raw sequences;
  
  c) isolating first tags from said isolated at least first ditags;
  
  d) determining abundance of said first tags; and
  
  e) identifying said first tags, further comprising a step off) rejecting said isolated first tags that are wrongly sequenced by means of a statistical model for sequencing errors to be applied to said isolated first tags, said statistical model being defined by a probability function F(a,b), wherein said function F is intended to modelize the probability that a given tag a can be sequenced as b.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 2. The method of claim 1, wherein said statistical model of said step f) uses a confidence level of a base-calling method that has been converted into a probability that a base-pair is not correct.
  - 3. The method of claim 1 or 2, wherein said step f) is applied after a step ofg) rejecting said isolated first tags that are wrongly sequenced by checking correctness of said isolated first ditags or tags through a base-calling method.
  - 4. The method of claim 2, wherein said base-calling method is performed through a base-calling software.
  - 5. The method according to claim 1, wherein said ditags are isolated from said set of raw sequences through a method for serial analysis of gene expression (SAGE).
  - 6. The method according to claim 3, wherein said steps a) through f) or a) through g) are applied to at least a second set of raw sequences and to respective second ditags and second tags, and further comprising a step ofh) comparing said first tags and said second tags in order to determine relative abundance thereof.
  - 7. The method according to claim 1, further comprising a step of predicting an extra base-pair for said tags.
  - 8. The method of claim 7, wherein said step of predicting an extra base-pair for said tags comprises the steps of:
    - assigning an extra base-pair to each tag of a ditag each time a ditag of at least a predetermined number of base-pairs is isolated;
      
      imposing for each of the tags to which an extra base-pair has been assigned to be as much abundant as a predetermined threshold;
      
      choosing, for each of the remaining tags, its most dominant extra base-pair; and
      
      imposing, for said most dominant extra base-pair, to represent at least a certain percentage of the number of occurrences of said tag.
  - 9. The method according to claim 1, further comprising a step of checking correctness of said isolated ditags by determining the length of said ditags and rejecting ditags smaller and/or larger than a predetermined length.
  - 10. The method according to claim 1, further comprising a step of checking correctness of said isolated ditags by determining an occurrence of said ditags and rejecting ditags occurring twice.
  - 11. The method according to claim 1, wherein said statistical model of said step f) modelizes the probability that a given tag a can be sequenced as b by applying an average sequencing error rate to said function F(a,b).
  - 12. The method of claim 11, wherein said average sequencing error rate is calculated through the system
  - 13. The method according to claim 1, wherein said step f) comprises a step of:
    - f1) calculating an estimate of the abundance of said isolated first tags not rejected by means of said statistical model.
  - 14. The method according to claim 13, wherein said step f1) is performed through a sparse diagonal-dominant linear system.
  - 15. The method according to either one of claims 13 or 14, wherein said step f) further comprises the steps of:
    - f2) calculating a difference between said estimate of the abundance of the tags and a counted abundance of the tags; and
      
      f3) eliminating tags for which said difference is bigger than a given threshold.
  - 16. The method according to claim 6, wherein said step h) of comparing said first and at least second ditags in order to determine relative abundance thereof comprises the steps of:
    - h1) merging said first and second tags;
      
      h2) normalizing the abundance of said first and second tags;
      
      h3) determining a difference of abundance between said first and second ditags; and
      
      h4) estimating correctness of said difference.
  - 17. The method of claim 16, wherein said step h4) of estimating correctness of said difference applies a Claverie method.
  - 18. The method according to claim 16, wherein said step h) of comparing said first and second ditags in order to determine relative abundance thereof further comprises the step of:
    - h5) checking consistence of the predicted extra base-pair.
  - 19. The method according to claim 6, wherein said step e) of identifying said at least first tags comprises the step of:
    - e1) performing a first identification of the tags by comparison with a first database.
  - 20. The method of claim 19, wherein said step e) of identifying said at least first tags further comprises the step of:
    - e2) assigning a score to each of the tags identified by said first identification.
  - 21. The method of claim 19, wherein said step e) of identifying said first and second tags further comprises the steps of:
    - e3) performing at least a second identification of the tags by comparison with at least a second database; and
      
      e4) assigning a score to each of the tags identified by said at least second identification.
  - 22. The method of claim 19, wherein said first database is an EST database.
  - 23. The method of claim 22, wherein the step b) is performed by using MmeI as a restriction enzyme.
  - 24. The method according to claim 1, further comprising a step of visualizing the identified tags.
  - 25. The method of claim 24, further comprising a step of providing means for accessing the information concerning said identified tags.
  - 26. The method of claim 24, further comprising a step for assigning a unique number for each identified tag.
  - 27. The method according to claim 24, further comprising a step for assigning a unique position on a screen for each identified tag.
  - 28. The method of claim 11, wherein said average sequencing error rate equals 1%.
  - 29. The method according to claim 28, wherein, for every tag a ∈
    - S_T, S_Tbeing the set of every possible tag to be considered, the value of F(a,b), ∈
      
      be S_T, is;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Laboratoires Serono SA (Merck & Co., Inc.)
Original Assignee
Applied Research Systems ARS Holding NV (Merck & Co., Inc.)
Inventors
Colinge, Jacques, Feger, Georg
Primary Examiner(s)
Martinell, James

Application Number

US10/169,134
Publication Number

US 20030138794A1
Time in Patent Office

2,078 Days
Field of Search

435/6, 702/19, 702/20
US Class Current

435/6.11
CPC Class Codes

G16B 25/00 ICT specially adapted for h...

G16B 25/10 Gene or protein expression ...

Method for the identification of gene transcripts with improved efficiency in the treatment of errors

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Method for the identification of gene transcripts with improved efficiency in the treatment of errors

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links