Virtual representations of nucleotide sequences
First Claim
1. A plurality of nucleic acid molecules, wherein:
- (a) said plurality consists of N nucleic acid molecules;
(b) each of said plurality of nucleic acid molecules has a nucleotide sequence that hybridizes specifically to a sequence in a genome of Z basepairs; and
(c) at least P % of said plurality of nucleic acid molecules have (i) a length of K nucleotides;
(ii) hybridizes specifically to at least one nucleic acid molecule present in or predicted to be present in a representation derived from said genome, said representation having no more than R % of the complexity of said genome; and
(iii) no more than X exact matches of L1 nucleotides to said genome and no fewer than Y exact matches of L1 nucleotides to said genome; and
wherein;
(A N≧
500;
(B) Z≧
1×
108;
(C) 300≧
K≧
30;
(D) 70≧
R≧
0.001;
(E) P=(N×
R+(3×
sigma))/N;
(F) sigma is the squareroot of (N×
R×
(1-R)) (G) the integer closest to (log4(Z)+2)≧
L1≧
the integer closest to log4(Z);
(H) X is the integer closest to D1×
(K−
L1+1);
(I) Y is the integer closest to D2×
(K−
L1+1);
(J) 1.5≧
D1≧
1; and
(K) 1≧
D2≧
0.5.
1 Assignment
0 Petitions
Accused Products
Abstract
The invention provides oligonucleotide probes that can be used to hybridize to a representation of nucleic acid sequences. Compositions containing the probes such as microarrays are also provided. The invention also provides methods of using these probes and compositions in therapeutic, diagnostic, and research applications. Systems and methods for using a word counting algorithm that can quickly and accurately count the number of times a particular string of characters (i.e., nucleotides) appears in a nucleotide sequence (e.g., a genome) are provided. This algorithm can be used to identify the oligonucleotide probes of the invention. The algorithm uses a transform of a genome and an auxiliary data structure to count the number of times a particular word occurs in the genome.
120 Citations
61 Claims
-
1. A plurality of nucleic acid molecules,
wherein: -
(a) said plurality consists of N nucleic acid molecules;
(b) each of said plurality of nucleic acid molecules has a nucleotide sequence that hybridizes specifically to a sequence in a genome of Z basepairs; and
(c) at least P % of said plurality of nucleic acid molecules have (i) a length of K nucleotides;
(ii) hybridizes specifically to at least one nucleic acid molecule present in or predicted to be present in a representation derived from said genome, said representation having no more than R % of the complexity of said genome; and
(iii) no more than X exact matches of L1 nucleotides to said genome and no fewer than Y exact matches of L1 nucleotides to said genome; and
wherein;
(A N≧
500;
(B) Z≧
1×
108;
(C) 300≧
K≧
30;
(D) 70≧
R≧
0.001;
(E) P=(N×
R+(3×
sigma))/N;
(F) sigma is the squareroot of (N×
R×
(1-R))(G) the integer closest to (log4(Z)+2)≧
L1≧
the integer closest to log4(Z);
(H) X is the integer closest to D1×
(K−
L1+1);
(I) Y is the integer closest to D2×
(K−
L1+1);
(J) 1.5≧
D1≧
1; and
(K) 1≧
D2≧
0.5. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
-
15. A plurality of nucleic acid molecules,
wherein: -
(a) said plurality consists of at least 100 nucleic acid molecules;
(b) each of said plurality of nucleic acid molecules has a nucleotide sequence that is at least 90% identical to a sequence in a genome of at least Z basepairs; and
(c) at least P % of said plurality of nucleic acid molecules have (i) a length of K nucleotides;
(ii) at least 90% sequence identity to at least one nucleic acid molecule present in or predicted to be present in a representation derived from said genome, said representation having no more than R % of the complexity of said genome; and
(iii) no more than X exact matches of L1 nucleotides to said representation and no fewer than Y exact matches of L1 nucleotides to said representation; and
wherein;
(A) Z≧
1×
108;
(B) 300≧
K≧
30;
(C) 70≧
R≧
0.001;
(D) P≧
90−
R;
(E) the integer closest to (log4((Z×
R)/100)+2)≧
L1≧
the integer closest to log4((Z×
R)/100);
(F) X is the integer closest to D1×
(K−
L1+1);
(G) Y is the integer closest to D2×
(K−
L1+1);
(H) 1.5≧
D1≧
1; and
(I) 1>
D2≧
0.5. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
49. A method of identifying an oligonucleotide that has:
-
(a) a length of K nucleotides;
(b) at least 90% sequence identity to at least one nucleic acid molecule present in or predicted to be present in a representation derived from a genome of at least Z basepairs, and (c) no more than X exact matches of L1 nucleotides to said genome and no fewer than Y exact matches of L1 nucleotides to said genome;
wherein;
(i) Z≧
1×
108;
(ii) 300≧
K≧
30;
(iii) the integer closest to (log4(Z)+2)≧
L1≧
the integer closest to log4(Z);
(iv) X is the integer closest to D1×
(K−
L1+1);
(v) Y is the integer closest to D2×
(K−
L1+1);
(vi) 1.5≧
D1≧
1; and
(vii) 1≧
D2 ≧
0.5;
the method comprising;
(A) cleaving said genome in silico with a restriction enzyme to generate a plurality of predicted nucleic acid molecules, (B) generating a virtual representation of said genome by identifying predicted nucleic acid molecules each having a length of 200-1,200 basepairs, inclusive;
(C) selecting an oligonucleotide having a length of 30-300 nucleotides, inclusive, and at least 90% sequence identity to a predicted nucleic acid molecule in (B);
(D) identifying all of the stretches of L1 nucleotides occurring in said oligonucleotide; and
(E) confirming that the number of times each of said stretches occurs in said genome satisfies the requirements of (c). - View Dependent Claims (50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61)
-
Specification