Indexing a reference sequence for oligomer sequence mapping

US 8,738,296 B2
Filed: 02/02/2010
Issued: 05/27/2014
Est. Priority Date: 02/03/2009
Status: Active Grant

First Claim

Patent Images

1. A method of generating an index, the index operable to determine where, in a reference sequence, a data set of one or more related oligomer sequences maps to the reference sequence, the oligomer sequences of the data set obtained from a same fragment of genetic material, the method comprising:

applying, with a computer system, a key pattern to the reference sequence to generate a plurality of keys, wherein the key pattern includes a first set of N contiguous positions separated by K positions from a second set of M contiguous positions, the separation being based on predicted relationships between oligomer sequences of the data set, wherein the key pattern is defined by predetermined values for N, K and M, the applying including;

applying the key pattern to a first location of the reference sequence to obtain a first set of bases, the first set of bases including;

N contiguous bases of the reference sequence starting from the first location, andM contiguous bases of the reference sequence starting from N+K positions after the first location, N, M, and K being integers greater than or equal to one;

using the first set of bases to generate a first key;

applying the key pattern to a plurality of other locations to generate other keys, wherein the applying of the key pattern to the first and other locations uses the same values for N, M, and K; and

storing the keys in the index, the index being stored in a searchable computer readable medium;

wherein each key corresponds to one or more possible locations within the reference sequence.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Generating an index includes receiving a reference sequence and applying one or more key patterns to the reference sequence to obtain a plurality of keys in the index. Each of the one or more key patterns is derived based on a corresponding set of oligomer sequence relationships of a plurality of oligomer sequences that are expected to be generated from the reference, and the keys correspond to a plurality of candidate and/or validated locations in the reference sequence.

Citations

28 Claims

1. A method of generating an index, the index operable to determine where, in a reference sequence, a data set of one or more related oligomer sequences maps to the reference sequence, the oligomer sequences of the data set obtained from a same fragment of genetic material, the method comprising:
- applying, with a computer system, a key pattern to the reference sequence to generate a plurality of keys, wherein the key pattern includes a first set of N contiguous positions separated by K positions from a second set of M contiguous positions, the separation being based on predicted relationships between oligomer sequences of the data set, wherein the key pattern is defined by predetermined values for N, K and M, the applying including;
  
  applying the key pattern to a first location of the reference sequence to obtain a first set of bases, the first set of bases including;
  
  N contiguous bases of the reference sequence starting from the first location, andM contiguous bases of the reference sequence starting from N+K positions after the first location, N, M, and K being integers greater than or equal to one;
  
  using the first set of bases to generate a first key;
  
  applying the key pattern to a plurality of other locations to generate other keys, wherein the applying of the key pattern to the first and other locations uses the same values for N, M, and K; and
  
  storing the keys in the index, the index being stored in a searchable computer readable medium;
  
  wherein each key corresponds to one or more possible locations within the reference sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 2. The method of claim 1, wherein the predicted sequence relationships comprise both variable and fixed separation distances between sequences of the data set.
  - 3. The method of claim 1, wherein the keys are generated by applying the key pattern using sequential base key generation.
  - 4. The method of claim 3, wherein the key pattern is sequentially forwarded on the reference sequence by a single base.
  - 5. The method of claim 1, wherein the index comprises keys generated using one or more genomic sequences as the reference sequence.
  - 6. The method of claim 5, wherein the one or more genomic sequences comprise a human sequence.
  - 7. The method of claim 5, wherein the one or more genomic sequences comprise substantially an entire genome.
  - 8. The method of claim 1, wherein the reference sequence comprises an RNA or cDNA sequence.
  - 9. The method of claim 8, wherein the RNA or cDNA sequence comprises a human sequence.
  - 10. The method of claim 1, wherein the reference sequence includes two or more variations of a single reference sequence.
  - 11. The method of claim 1, wherein the index includes keys generated from two or more references sequences.
  - 12. The method of claim 1, wherein the predicted relationships between sequences of the data set are based in part on statistical distribution information for conserved positions in the data set.
  - 13. The method of claim 12, wherein the statistical distribution information includes a predicted distribution of distances between two or more adjacent oligomers.
  - 14. The method of claim 12, wherein the statistical distribution information includes a predicted distribution of distance combinations within a set of related oligomers.
  - 15. The method of claim 1, wherein the key pattern has a pattern length sufficiently long to avoid generating an undesired number of candidate locations in the reference sequence.
  - 16. The method of claim 1, wherein the index comprises keys having bases in a sequential order of the reference sequence.
  - 17. The method of claim 1, wherein the index includes two or more sub-indexes.
  - 18. The method of claim 17, wherein the index includes a prefix index.
  - 19. The method of claim 18, wherein the entries in the prefix index map to a plurality of offsets in a sub-index.
  - 20. The method of claim 17, wherein the index further includes a suffix index.
  - 21. The method of claim 20, wherein the entries in the suffix index map to a plurality of possible locations in the reference sequence.
  - 22. The method of claim 1, further comprising:
    - modifying the first key by reordering a sequence of bases in the first key before storing the first key in the index.
  - 23. The method of claim 1, further comprising:
    - comparing sequences of the data set to the plurality of keys to map the sequences to the reference sequence.
  - 24. The method of claim 1, wherein the predicted sequence relationships include at least one variable separation distance between two oligomer sequences of the data set.
  - 25. The method of claim 1, wherein the index stores the keys ordered by bases at one or more positions in the keys.
  - 26. The method of claim 25, wherein the keys are in lexical order.

27. A system for generating an index for oligomer sequence analysis, the index operable to determine where, in a reference sequence, a data set of one or more related oligomer sequences maps to the reference sequence, the oligomer sequences of the data set obtained from a same fragment of genetic material, the system comprising:
- an interface configured to receive the reference sequence; and
  
  a processor coupled to the interface, the processor configured to apply a key pattern to the reference sequence to obtain a plurality of keys for storing in the index,wherein the key pattern includes a first set of N contiguous positions separated by K positions from a second set of M contiguous positions, the separation being based on predicted relationships between oligomer sequences of the data set wherein the key pattern is defined by predetermined values for N, K and M, the applying including;
  
  applying the key pattern to a first location of the reference sequence to obtain a first set of bases, the first set of bases including;
  
  N contiguous bases of the reference sequence starting from the first location, andM contiguous bases of the reference sequence starting from N+K positions after the first location, N, M, and K being integers greater than or equal to one;
  
  using the first set of bases to generate a first key;
  
  applying the key pattern to a plurality of other locations to generate other keys, wherein the applying of the key pattern to the first and other locations uses the same values for N, M, and K,wherein the keys correspond to possible locations in the reference sequence.

28. A computer program product for generating an index, the index operable to determine where, in a reference sequence, a data set of one or more related oligomer sequences maps to the reference sequence, the oligomer sequences of the data set obtained from a same fragment of genetic material, the computer program product being embodied in a non-transitory computer readable medium and comprising computer instructions for:
- receiving a reference sequence; and
  
  applying a key pattern to the reference sequence to obtain a plurality of keys for storing in the index,wherein the key pattern includes a first set of N contiguous positions separated by K positions from a second set of M contiguous positions, the separation being based on predicted relationships between oligomer sequences of the data set wherein the key pattern is defined by predetermined values for N, K and M, the applying including;
  
  applying the key pattern to a first location of the reference sequence to obtain a first set of bases, the first set of bases including;
  
  N contiguous bases of the reference sequence starting from the first location, andM contiguous bases of the reference sequence starting from N+K positions after the first location, N, M, and K being integers greater than or equal to one;
  
  using the first set of bases to generate a first key;
  
  applying the key pattern to a plurality of other locations to generate other keys, wherein the applying of the key pattern to the first and other locations uses the same values for N, M, and K,wherein the keys correspond to possible locations in the reference sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Complete Genomics Incorporated (BGI Genomics Co., Ltd.)
Original Assignee
Complete Genomics Incorporated (BGI Genomics Co., Ltd.)
Inventors
Halpern, Aaron L., Nazarenko, Igor
Primary Examiner(s)
Skibinsky, Anna

Application Number

US12/698,986
Publication Number

US 20100287165A1
Time in Patent Office

1,575 Days
Field of Search
US Class Current

702/19
CPC Class Codes

G06F 16/22   Indexing; Data structures t...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

Indexing a reference sequence for oligomer sequence mapping

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Indexing a reference sequence for oligomer sequence mapping

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links