Virtual representations of nucleotide sequences

US 20050032095A1
Filed: 05/21/2004
Published: 02/10/2005
Est. Priority Date: 05/23/2003
Status: Active Grant

First Claim

Patent Images

1. A plurality of nucleic acid molecules, wherein:

(a) said plurality consists of N nucleic acid molecules;

(b) each of said plurality of nucleic acid molecules has a nucleotide sequence that hybridizes specifically to a sequence in a genome of Z basepairs; and

(c) at least P % of said plurality of nucleic acid molecules have (i) a length of K nucleotides;

(ii) hybridizes specifically to at least one nucleic acid molecule present in or predicted to be present in a representation derived from said genome, said representation having no more than R % of the complexity of said genome; and

(iii) no more than X exact matches of L₁nucleotides to said genome and no fewer than Y exact matches of L₁nucleotides to said genome; and

wherein;

(A N≧

500;

(B) Z≧

1×

10⁸;

(C) 300≧

K≧

30;

(D) 70≧

R≧

0.001;

(E) P=(N×

R+(3×

sigma))/N;

(F) sigma is the squareroot of (N×

R×

(1-R)) (G) the integer closest to (log₄(Z)+2)≧

L₁≧

the integer closest to log₄(Z);

(H) X is the integer closest to D₁×

(K−

L₁+1);

(I) Y is the integer closest to D₂×

(K−

L₁+1);

(J) 1.5≧

D₁≧

1; and

(K) 1≧

D₂≧

0.5.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides oligonucleotide probes that can be used to hybridize to a representation of nucleic acid sequences. Compositions containing the probes such as microarrays are also provided. The invention also provides methods of using these probes and compositions in therapeutic, diagnostic, and research applications. Systems and methods for using a word counting algorithm that can quickly and accurately count the number of times a particular string of characters (i.e., nucleotides) appears in a nucleotide sequence (e.g., a genome) are provided. This algorithm can be used to identify the oligonucleotide probes of the invention. The algorithm uses a transform of a genome and an auxiliary data structure to count the number of times a particular word occurs in the genome.

120 Citations

View as Search Results

61 Claims

1. A plurality of nucleic acid molecules, wherein:
- (a) said plurality consists of N nucleic acid molecules;
  
  (b) each of said plurality of nucleic acid molecules has a nucleotide sequence that hybridizes specifically to a sequence in a genome of Z basepairs; and
  
  (c) at least P % of said plurality of nucleic acid molecules have (i) a length of K nucleotides;
  
  (ii) hybridizes specifically to at least one nucleic acid molecule present in or predicted to be present in a representation derived from said genome, said representation having no more than R % of the complexity of said genome; and
  
  (iii) no more than X exact matches of L₁nucleotides to said genome and no fewer than Y exact matches of L₁nucleotides to said genome; and
  
  wherein;
  
  (A N≧
  
  500;
  
  (B) Z≧
  
  1×
  
  10⁸;
  
  (C) 300≧
  
  K≧
  
  30;
  
  (D) 70≧
  
  R≧
  
  0.001;
  
  (E) P=(N×
  
  R+(3×
  
  sigma))/N;
  
  (F) sigma is the squareroot of (N×
  
  R×
  
  (1-R)) (G) the integer closest to (log₄(Z)+2)≧
  
  L₁≧
  
  the integer closest to log₄(Z);
  
  (H) X is the integer closest to D₁×
  
  (K−
  
  L₁+1);
  
  (I) Y is the integer closest to D₂×
  
  (K−
  
  L₁+1);
  
  (J) 1.5≧
  
  D₁≧
  
  1; and
  
  (K) 1≧
  
  D₂≧
  
  0.5.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 2. The plurality of nucleic acid molecules of claim 1, wherein N is selected from the group consisting of at least 500;
    - at least 1,000;
      
      at least 2,500;
      
      at least 5,000;
      
      at least 10,000;
      
      at least 25,000;
      
      at least 50,000;
      
      at least 85,000;
      
      at least 190,000;
      
      at least 350,000; and
      
      at least 550,000 nucleic acid molecules.
  - 3. The plurality of nucleic acid molecules of claim 1, wherein Z is selected from the group consisting of at least 3×
    - 10⁸, at least 1×
      
      10⁹, at least 1×
      
      10¹⁰and at least 1×
      
      10¹¹.
  - 4. The plurality of nucleic acid molecules of claim 1, wherein the genome is a mammalian genome.
  - 5. The plurality of nucleic acid molecules of claim 4, wherein the genome is a human genome.
  - 6. The plurality of nucleic acid molecules of claim 1, wherein R is selected from the group consisting of 0.001, 1, 2, 4, 10, 15, 20, 30, 40, 50 and 70.
  - 7. The plurality of nucleic acid molecules of claim 1, wherein P is selected from the group consisting of at least 70, at least 80, at least 90, at least 95, at least 97 and at least 99.
  - 8. The plurality of nucleic acid molecules of claim 1, wherein D₁is 1.
  - 9. The plurality of nucleic acid molecules of claim 1, wherein D₂is 1.
  - 10. The plurality of nucleic acid molecules of claim 1, wherein L₁is selected from the group consisting of 15, 16, 17, 18, 19, 20, 21, 22, 23 and 24.
  - 11. The plurality of nucleic acid molecules of claim 1, wherein each of said P % of said plurality of nucleic acid molecules further have no more than A exact matches of L₂nucleotides to said genome and no fewer than B exact matches of L₂nucleotides to said genome;
    - and wherein (a) L₁>
      
      L₂≧
      
      the integer closest to log₄(Z)−
      
      3;
      
      (b) A is the integer closest to D₃×
      
      ((K−
      
      L₂+1)×
      
      (Z/4^L₂));
      
      (c) B is the integer closest to D₄×
      
      ((K−
      
      L₂+1)×
      
      (Z/4^L₂));
      
      (d) 4≧
      
      D₃≧
      
      1; and
      
      (e) 1>
      
      D₄≧
      
      0.5.
  - 12. The plurality of nucleic acid molecules of claim 11, wherein D₃≦
    - 3, 2, or 1.5.
  - 13. The plurality of nucleic acid molecules of claim 1, wherein said P % of said plurality of nucleic acid molecules have at least 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% sequence identity to at least one nucleic acid molecule present or predicted to be present in said representation.
  - 14. The plurality of nucleic acid molecules of claim 1, wherein K is selected from the group consisting of 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 200 and 250.
  - 28. The plurality of nucleic acid molecules of claim 1, wherein said representation is produced by sequence-specific cleavage of said genome.
  - 29. The plurality of nucleic acid molecules of claim 28, wherein sequence-specific cleavage is accomplished with a restriction endonuclease.
  - 30. The plurality of nucleic acid molecules of claim 1, wherein said representation is a compound representation.
  - 31. The plurality of nucleic acid molecules of claim 1, wherein said plurality of nucleic acid molecules are immobilized on the surface of a solid phase.
  - 32. The plurality of nucleic acid molecules of claim 31, wherein said solid phase is selected from the group consisting of a nylon membrane, a nitrocellulose membrane, a glass slide, and a microsphere.
  - 33. The plurality of nucleic acid molecules of claim 31, wherein the positions of said plurality of nucleic acid molecules on said solid phase are known.
  - 34. The plurality of nucleic acid molecules of claim 33, wherein said plurality of nucleic acid molecules is on a microarray.
  - 35. The plurality of nucleic acid molecules of claim 33, wherein said plurality of nucleic acid molecules is immobilized on microspheres.
  - 36. A method of analyzing a nucleic acid sample, said method comprising:
    - (a) hybridizing the sample to the plurality of nucleic acid molecules of claim 1;
      
      and (b) determining to which of said plurality of nucleic acid molecules said sample hybridizes.
  - 37. The method of claim 36, wherein said sample is a representation.
  - 38. The method of claim 36, wherein said plurality of nucleic acid molecules is immobilized on the surface of a solid phase.
  - 39. The method of claim 38, wherein said solid phase is selected from the group consisting of a nylon membrane, a nitrocellulose membrane, a glass slide, and a microsphere.
  - 40. The method of claim 38, wherein the positions of said plurality of nucleic acid molecules on said solid phase are known.
  - 41. The method of claim 40, wherein said plurality of nucleic acid molecules is on a microarray.
  - 42. The method of claim 38, wherein said plurality of nucleic acid molecules is immobilized on microspheres.
  - 43. A method of analyzing copy number variation of a genomic sequence between two genomes, said method comprising:
    - (a) providing a first genome and a second genome;
      
      (b) preparing detectably labeled representations of each genome using at least one identical restriction enzyme;
      
      (c) contacting said representations with the plurality of nucleic acid molecules of claim 1 or 31 to allow hybridization between the representations and said plurality of nucleic acid molecules; and
      
      (d) comparing levels of the hybridization of said representations, wherein a difference in said levels indicates a copy number variation between the two genomes with regard to a genomic sequence targeted by said member.
  - 44. The method of claim 43, wherein the two representations are distinguishably labeled.
  - 45. The method of claim 44, wherein said representations are simultaneously contacted with said plurality of nucleic acid molecules.
  - 46. A method of comparing methylation status of a genomic sequence between two genomes, said method comprising:
    - (a) providing a first genome and a second genome;
      
      (b) preparing detectably labeled representations of each genome using at least one identical enzyme, wherein said representations are prepared by a methylation-sensitive method;
      
      (c) contacting said representations with the plurality of nucleic acid molecules of claim 1 or 31 to allow hybridization between the representations and said plurality of nucleic acid molecules; and
      
      (d) comparing levels of the hybridization of said representations, wherein a difference in said levels indicates a difference in methylation status between the two genomes with regard to a genomic sequence targeted by said member.
  - 47. The method of claim 46, wherein said methylation sensitive method involves preparing a first representation using a first restriction enzyme and a second representation using a second restriction enzyme, wherein said first and second restriction enzymes recognize the same restriction site but one is methylation-sensitive and the other is not.
  - 48. The method of claim 46, wherein said methylation sensitive method involves chemical cleavage of methyl-C sequences after making a representation with a non-methylation sensitive restriction enzyme, such that a representation derived from a methylated genome is distinguishable from a representation derived from a non-methylated genome.

15. A plurality of nucleic acid molecules, wherein:
- (a) said plurality consists of at least 100 nucleic acid molecules;
  
  (b) each of said plurality of nucleic acid molecules has a nucleotide sequence that is at least 90% identical to a sequence in a genome of at least Z basepairs; and
  
  (c) at least P % of said plurality of nucleic acid molecules have (i) a length of K nucleotides;
  
  (ii) at least 90% sequence identity to at least one nucleic acid molecule present in or predicted to be present in a representation derived from said genome, said representation having no more than R % of the complexity of said genome; and
  
  (iii) no more than X exact matches of L₁nucleotides to said representation and no fewer than Y exact matches of L₁nucleotides to said representation; and
  
  wherein;
  
  (A) Z≧
  
  1×
  
  10⁸;
  
  (B) 300≧
  
  K≧
  
  30;
  
  (C) 70≧
  
  R≧
  
  0.001;
  
  (D) P≧
  
  90−
  
  R;
  
  (E) the integer closest to (log4((Z×
  
  R)/100)+2)≧
  
  L₁≧
  
  the integer closest to log₄((Z×
  
  R)/100);
  
  (F) X is the integer closest to D_1×(K−
  
  L₁+1);
  
  (G) Y is the integer closest to D_2×(K−
  
  L₁+1);
  
  (H) 1.5≧
  
  D₁≧
  
  1; and
  
  (I) 1>
  
  D₂≧
  
  0.5.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 16. The plurality of nucleic acid molecules of claim 15, comprising at least 500;
    - at least 1,000;
      
      at least 2,500;
      
      at least 5,000;
      
      at least 10,000;
      
      at least 25,000;
      
      at least 50,000;
      
      at least 85,000;
      
      at least 190,000;
      
      at least 350,000;
      
      or at least 550,000 nucleic acid molecules.
  - 17. The plurality of nucleic acid molecules of claim 15, wherein Z is selected from the group consisting of at least 3×
    - 10⁸, at least 1×
      
      10⁹, at least 1×
      
      10¹⁰and at least 1×
      
      10¹¹.
  - 18. The plurality of nucleic acid molecules of claim 15, wherein the genome is a mammalian genome.
  - 19. The plurality of nucleic acid molecules of claim 18, wherein the genome is a human genome.
  - 20. The plurality of nucleic acid molecules of claim 15, wherein R is selected from the group consisting of 0.001, 1, 2, 4, 10, 15, 20, 30, 40, 50 and 70.
  - 21. The plurality of nucleic acid molecules of claim 15, wherein P is selected from the group consisting of at least 70, at least 80, at least 90, at least 95, at least 97 and at least 99.
  - 22. The plurality of nucleic acid molecules of claim 15, wherein D₁is 1.
  - 23. The plurality of nucleic acid molecules of claim 15, wherein D₂is 1.
  - 24. The plurality of nucleic acid molecules of claim 15, wherein L₁is selected from the group consisting of 15, 16, 17, 18, 19, 20, 21, 22, 23 and 24.
  - 25. The plurality of nucleic acid molecules of claim 15, wherein each of said P % of said plurality of nucleic acid molecules further have no more than A exact matches of L₂nucleotides to said genome and no fewer than B exact matches of L₂nucleotides to said genome;
    - and wherein (a) L₁≧
      
      L₂≧
      
      the integer closest to log₄(Z)−
      
      3;
      
      (b) A is the integer closest to D₃×
      
      ((K−
      
      L₂+1)×
      
      (Z/4^L₂));
      
      (c) B is the integer closest to D₄×
      
      ((K−
      
      L₂+1)×
      
      (Z/4^L₂));
      
      (d) 4≧
      
      D₃≧
      
      1; and
      
      (e) 1>
      
      D₄≧
      
      0.5.
  - 26. The plurality of nucleic acid molecules of claim 15, wherein said P % of said plurality of nucleic acid molecules have at least 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% sequence identity to at least one nucleic acid molecule present or predicted to be present in said representation.
  - 27. The plurality of nucleic acid molecules of claim 15, wherein K is selected from the group consisting of 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 200 and 250.

49. A method of identifying an oligonucleotide that has:
- (a) a length of K nucleotides;
  
  (b) at least 90% sequence identity to at least one nucleic acid molecule present in or predicted to be present in a representation derived from a genome of at least Z basepairs, and (c) no more than X exact matches of L₁nucleotides to said genome and no fewer than Y exact matches of L₁nucleotides to said genome;
  
  wherein;
  
  (i) Z≧
  
  1×
  
  10⁸;
  
  (ii) 300≧
  
  K≧
  
  30;
  
  (iii) the integer closest to (log₄(Z)+2)≧
  
  L₁≧
  
  the integer closest to log₄(Z);
  
  (iv) X is the integer closest to D₁×
  
  (K−
  
  L₁+1);
  
  (v) Y is the integer closest to D₂×
  
  (K−
  
  L₁+1);
  
  (vi) 1.5≧
  
  D₁≧
  
  1; and
  
  (vii) 1≧
  
  D₂≧
  
  0.5;
  
  the method comprising;
  
  (A) cleaving said genome in silico with a restriction enzyme to generate a plurality of predicted nucleic acid molecules, (B) generating a virtual representation of said genome by identifying predicted nucleic acid molecules each having a length of 200-1,200 basepairs, inclusive;
  
  (C) selecting an oligonucleotide having a length of 30-300 nucleotides, inclusive, and at least 90% sequence identity to a predicted nucleic acid molecule in (B);
  
  (D) identifying all of the stretches of L₁nucleotides occurring in said oligonucleotide; and
  
  (E) confirming that the number of times each of said stretches occurs in said genome satisfies the requirements of (c).
- View Dependent Claims (50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61)
- - 50. The method of claim 49, wherein (E) comprises:
    - providing a compressed transform of said genome;
      
      providing an auxiliary data structure that includes information related to said genome; and
      
      determining a word count for the L₁nucleotides using the compressed transform and the auxiliary data structure.
  - 51. The method of claim 49, wherein (E) comprises:
    - providing a compressed transform of said genome;
      
      iterating through each nucleotide of said stretch of L₁nucleotides, starting with the last nucleotide and advancing to the first nucleotide one character per iteration, wherein the nucleotide corresponding to a particular iteration is stored as an index nucleotide, said iterating further comprising;
      
      defining a search region that delineates a contiguous range of nucleotides within said transform;
      
      counting the number of times the nucleotide preceding said index nucleotide occurs in said search range; and
      
      wherein said iterating ceases if no occurrences of the nucleotide preceding said index nucleotide occurs in said search range; and
      
      outputting the number of times the first nucleotide of said stretch of L₁nucleotides is counted, this number being equivalent to the number times said stretch of L₁nucleotides appears in said genome.
  - 52. The method of claim 51, further comprising:
    - providing an auxiliary data structure, said auxiliary data structure comprising;
      
      a K-intervals data structure that maintains a running total of each nucleotide that has appeared in said transform up to and including a particular predetermined location in said compressed transform; and
      
      a dictionary-counts data structure that provides fast look-up access to the compressed transform; and
      
      wherein said counting and said defining are performed using said auxiliary data structure and said compressed transform.
  - 53. The method of claim 52, wherein said transform remains compressed while said counting is being performed.
  - 54. The method of claim 52, wherein said compressed transform is compressed such that every three characters in the uncompressed transform are compressed to form a byte, and wherein said counting uncompresses at most one such byte during one of said iterations.
  - 55. The method according to claim 51, wherein said genome comprises at least three billion characters.
  - 56. The method according to claim 51, wherein said compressed transform is a Burrows-Wheeler transform of the genome.
  - 57. The method according to claim 51, further comprising providing data which is based on said transform, wherein said defining comprises using said data and said index nucleotide to define said search region.
  - 58. The method according to claim 51, further comprising:
    - providing data which is based on said transform; and
      
      determining a prior nucleotide count, said prior nucleotide count being the number of times the nucleotide preceding the index nucleotide occurs in said transform before the beginning of said search region;
      
      wherein said defining comprises using said data, said index nucleotide, and said prior nucleotide count to define said search region.
  - 59. The method according to claim 58, wherein said prior nucleotide count is obtained using K-intervals, said K-intervals being stored at predetermined locations along said transform and maintain a running total of each nucleotide that has appeared in said transform up to and including a particular predetermined location.
  - 60. A plurality of oligonucleotides each of which produced by the method of claim 49, said plurality comprising at least 500 oligonucleotides.
  - 61. A plurality of oligonucleotides each of which produced by the method of claim 49, said plurality comprising at least 1,000;
    - at least 2,500;
      
      at least 5,000;
      
      at least 10,000;
      
      at least 25,000;
      
      at least 50,000;
      
      at least 85,000;
      
      at least 190,000;
      
      at least 350,000;
      
      or at least 550,000 oligonucleotides.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cold Spring Harbor Laboratory
Original Assignee
Cold Spring Harbor Laboratory
Inventors
Healy, John, Lucito, Robert, Wigler, Michael H.

Granted Patent

US 8,694,263 B2
Time in Patent Office

Days
Field of Search
US Class Current

435/6
CPC Class Codes

C07H 21/04 with deoxyribosyl as saccha...

Virtual representations of nucleotide sequences

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

120 Citations

61 Claims

Specification

Use Cases

Quick Links

Others

Virtual representations of nucleotide sequences

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

120 Citations

61 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others