Efficient methods and apparatus for high-throughput processing of gene sequence data

US 7,110,885 B2
Filed: 09/26/2001
Issued: 09/19/2006
Est. Priority Date: 03/08/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A method of processing gene sequence data with use of one or more computers, the method comprising:

reading, by the computer, gene sequence data corresponding to a gene sequence and coding sequence data corresponding to a plurality of coding sequences within the gene sequence;

identifying, by the computer following a set of primer selection rules, primer pair data within the gene sequence data, the primer pair data corresponding to a pair of primer sequences for one of the coding sequences,the set of primer selection rules including a first rule specifying that the primer pair data for the coding sequence be obtained for a predetermined annealing temperature;

the set of primer selection rules including a second rule specifying that, based on a comparison of the primer pair data and gene family data, wherein the gene family data represents a gene family member of the gene sequence other than the gene sequence, stored in a file, the primer pair data for the coding sequence must fail to match the gene family data;

storing the primer pair data;

repeating the acts of identifying and storing such that primer pair data are obtained for each coding sequence of the plurality of coding sequences at the predetermined annealing temperature; and

simultaneously amplifying the plurality of coding sequences in gene sequences from three or more individuals at the predetermined annealing temperature using the identified pairs of primer sequences, such that a plurality of amplified coding sequences from the three or more individuals are obtained.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One disclosed method of processing gene sequence data includes the steps of reading gene sequence data corresponding to a gene sequence and coding sequence data corresponding to a plurality of coding sequences within the gene sequence; identifying and storing, by following a set of primer selection rules, primer pair data within the gene sequence data for one of the coding sequences; repeating the acts of identifying and storing such that primer pair data are obtained for each sequence of the plurality of coding sequences; and simultaneously amplifying the plurality of coding sequences in gene sequences from three or more of individuals using the identified pairs of primer sequences. The set of primer selection rules include a rule specifying that all of the primer pair data for the plurality of coding sequences be obtained for a predetermined annealing temperature, which allows for the subsequent simultaneous amplification of sequences from hundreds of individuals in a single amplification run.

Citations

26 Claims

1. A method of processing gene sequence data with use of one or more computers, the method comprising:
- reading, by the computer, gene sequence data corresponding to a gene sequence and coding sequence data corresponding to a plurality of coding sequences within the gene sequence;
  
  identifying, by the computer following a set of primer selection rules, primer pair data within the gene sequence data, the primer pair data corresponding to a pair of primer sequences for one of the coding sequences,the set of primer selection rules including a first rule specifying that the primer pair data for the coding sequence be obtained for a predetermined annealing temperature;
  
  the set of primer selection rules including a second rule specifying that, based on a comparison of the primer pair data and gene family data, wherein the gene family data represents a gene family member of the gene sequence other than the gene sequence, stored in a file, the primer pair data for the coding sequence must fail to match the gene family data;
  
  storing the primer pair data;
  
  repeating the acts of identifying and storing such that primer pair data are obtained for each coding sequence of the plurality of coding sequences at the predetermined annealing temperature; and
  
  simultaneously amplifying the plurality of coding sequences in gene sequences from three or more individuals at the predetermined annealing temperature using the identified pairs of primer sequences, such that a plurality of amplified coding sequences from the three or more individuals are obtained.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the first rule further specifies that each primer sequence have a length that falls within one or more predetermined ranges of lengths.
  - 3. The method of claim 1, wherein the set of primer selection rules includes a third rule specifying that a single primer pair be identified for two coding regions if one coding region is within a predetermined number of nucleotide base identifiers from the other coding region.
  - 4. The method of claim 1, further comprising:
    - sequencing the plurality of amplified coding sequences to produce a plurality of nucleotide base identifier strings.
  - 5. The method of claim 4, wherein the plurality of nucleotide base identifier strings includes nucleotide base identifiers represented by the letters G, A, T, and C.
  - 6. The method of claim 5, further comprising:
    - positionally aligning, by the computer, the plurality of nucleotide base identifier strings to produce a plurality of aligned nucleotide base identifier strings.
  - 7. The method of claim 6, further comprising:
    - performing, by the computer, a comparison amongst aligned nucleotide base identifiers at each nucleotide base position of the plurality of aligned nucleotide base identifier strings.
  - 8. The method of claim 7, performing the following additional acts at each nucleotide base position where a difference amongst aligned nucleotide base identifiers exists:
    - reading, by the computer, nucleotide base quality information associated with the aligned nucleotide base identifiers where the difference exists;
      
      comparing, by the computer, the nucleotide base quality information with predetermined qualification data;
      
      visually displaying, from the computer, the nucleotide base quality information for acceptance or rejection; and
      
      if the nucleotide base quality information meets the predetermined qualification data and is accepted;
      
      providing and storing resulting data that identifies where the difference amongst the aligned base identifiers exists.
  - 9. The method of claim 8, wherein the resulting data comprise single nucleotide polymorphism (SNP) identification data.
  - 10. The method of claim 8, wherein the nucleotide base quality information comprise one or more phred values.
  - 11. The method of claim 9, wherein after providing and storing all resulting data that identifies where the differences amongst the aligned nucleotide base identifiers exist, performing the following additional acts for each aligned nucleotide base identifier at each nucleotide base position where a difference exists:
    - comparing, by the computer, the nucleotide base identifier with a predetermined nucleotide base identifier to identify whether the nucleotide base identifier is a variant; and
      
      providing and storing, by the computer, additional resulting data that identifies whether the nucleotide base identifier is a variant.
  - 12. The method of claim 11, wherein the additional resulting data comprises haplotype identification data.
  - 13. The method of claim 12, wherein providing and storing additional resulting data comprises providing and storing a binary value of ‘
    - 0’
      
      for those nucleotide base identifiers that are identified as variants and a binary value of ‘
      
      1’
      
      for those nucleotide base identifiers that are not.

14. A computer program product comprising:
- a computer-usable storage medium;
  
  computer-readable program code embodied on said computer-usable storage medium; and
  
  the computer-readable program code for effecting the following acts on a computer;
  
  reading gene sequence data corresponding to a gene sequence and coding sequence data corresponding to a plurality of coding sequences within the gene sequence;
  
  identifying primer pair data within the gene sequence data by following a set of primer selection rules, the primer pair data corresponding to a pair of primer sequences for one of the coding sequences,the set of primer selection rules including a first rule specifying that the primer pair data for the coding sequence be obtained for a predetermined annealing temperature;
  
  the set of primer selection rules including a second rule specifying that, based on a comparison of the primer pair data and gene family data, wherein the gene family data represents a gene family member of the gene sequence other than the gene sequence, stored in a file, the primer pair data for the coding sequence must fail to match the gene family data;
  
  storing the primer pair data; and
  
  repeating the acts of identifying and storing such that primer pair data are obtained for each coding sequence of the plurality of coding sequences at the predetermined annealing temperature, so that the plurality of coding sequences can be simultaneously amplified in gene sequences from three or more of individuals at the predetermined annealing temperature using the identified pairs of primer sequences to produce a plurality of amplified coding sequences from the three or more individuals.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 15. The computer program product of claim 14, wherein the first rule further specifies that each primer sequence have a length that falls within one or more predetermined ranges of lengths.
  - 16. The computer program product of claim 14, wherein the set of primer selection rules includes a third rule specifying that a single primer pair be identified for two coding regions if one coding region is within a predetermined number of nucleotide base identifiers from the other coding region.
  - 17. The computer program product of claim 14, wherein the plurality of amplified coding sequences are sequenced to produce a plurality of nucleotide base identifier strings.
  - 18. The computer program product of claim 17, wherein the plurality of nucleotide base identifier strings includes nucleotide base identifiers represented by the letters G, A, T, and C.
  - 19. The computer program product of claim 18, wherein the computer-readable program code is for effecting the following further acts on the computer:
    - positionally aligning the plurality of nucleotide base identifier strings to produce a plurality of aligned nucleotide base identifier strings.
  - 20. The computer program product of claim 19, wherein the computer-readable program code is for effecting the following further acts on the computer:
    - performing a comparison amongst aligned nucleotide base identifiers at each nucleotide base position of the plurality of aligned nucleotide base identifier strings.
  - 21. The computer program product of claim 20, wherein the computer-readable program code is for effecting the following additional acts at each nucleotide base position where a difference amongst aligned nucleotide base identifiers exists:
    - reading nucleotide base quality information associated with the aligned nucleotide base identifiers where the difference exists;
      
      comparing the nucleotide base quality information with predetermined qualification data;
      
      visually displaying the nucleotide base quality information for acceptance or rejection; and
      
      if the nucleotide base quality information meets the predetermined qualification data and is accepted;
      
      providing and storing resulting data that identifies where the difference amongst the aligned base identifiers exists.
  - 22. The computer program product of claim 21, wherein the resulting data comprise single nucleotide polymorphism (SNP) identification data.
  - 23. The computer program product of claim 21, wherein the nucleotide base quality information comprise one or more phred values.
  - 24. The computer program product of claim 22, wherein after providing and storing all resulting data that identifies where the differences amongst the aligned nucleotide base identifiers exist, performing the following additional acts for each aligned nucleotide base identifier at each nucleotide base position where such difference exists:
    - comparing the nucleotide base identifier with a predetermined nucleotide base identifier to identify whether the nucleotide base identifier is a variant; and
      
      providing and storing additional resulting data that identifies whether the nucleotide base identifier is a variant.
  - 25. The computer program product of claim 24, wherein the additional resulting data comprises haplotype identification data.
  - 26. The computer program product of claim 25, wherein providing and storing additional resulting data comprises providing and storing a binary value of ‘
    - 0’
      
      for those nucleotide base identifiers that are identified as variants and a binary value of ‘
      
      1’
      
      for those nucleotide base identifiers that are not.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
DNA Diagnostics Center Incorporated (Eurofins Scientific SE)
Original Assignee
DNAPrint Genomics, Inc.
Inventors
Frudakis, Tony Nick
Primary Examiner(s)
Marschel, Ardin H.
Assistant Examiner(s)
Lin, Jerry

Application Number

US09/964,059
Publication Number

US 20030171875A1
Time in Patent Office

1,819 Days
Field of Search

702/19, 702/20, 435/6
US Class Current

702/20
CPC Class Codes

G16B 30/00 ICT specially adapted for s...

G16B 30/10 Sequence alignment; Homolog...

Efficient methods and apparatus for high-throughput processing of gene sequence data

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient methods and apparatus for high-throughput processing of gene sequence data

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links