Compressing, storing and searching sequence data

US 20130191351A1
Filed: 12/20/2012
Published: 07/25/2013
Est. Priority Date: 12/20/2011
Status: Active Grant

First Claim

Patent Images

1. A computer program product comprising a non-transitory machine-readable medium that stores a program, the program being executed by a machine to perform a method of compressing and searching genomic sequence data, comprising:

first program code adapted to compress an original data sequence into a data structure having first and second portions, the first portion comprising the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to previously-identified fragments being replaced by links to similarly-identified fragments, the second portion comprising the links; and

second program code adapted to use the first and second portions of the data structure, in lieu of the original data sequence, to identify a portion of the genomic sequence data in response to a query.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The redundancy in genomic sequence data is exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods that are referred to herein as “compressive” algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. In this approach, the redundancy among genomes is translated into computational acceleration by storing genomes in a compressed format that respects the structure of similarities and differences important to analysis. Specifically, these differences are the nucleotide substitutions, insertions, deletions, and rearrangements introduced by evolution. Once such a compressed library has been created, analysis is performed on it in time proportional to its compressed size, rather than having to reconstruct the full data set every time one wishes to query it.

15 Citations

View as Search Results

19 Claims

1. A computer program product comprising a non-transitory machine-readable medium that stores a program, the program being executed by a machine to perform a method of compressing and searching genomic sequence data, comprising:
- first program code adapted to compress an original data sequence into a data structure having first and second portions, the first portion comprising the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to previously-identified fragments being replaced by links to similarly-identified fragments, the second portion comprising the links; and
  
  second program code adapted to use the first and second portions of the data structure, in lieu of the original data sequence, to identify a portion of the genomic sequence data in response to a query.
- View Dependent Claims (2, 3, 4, 5, 6, 19)
- - 2. The computer program product as described in claim 1 wherein the second program code comprises:
    - program code to search the first portion to locate one or more hits passing a first, coarse threshold;
      
      program code that, for each of the one or more hits, examines the second portion to identify other segments of the data sequence that potentially align to the hit;
      
      program code that, for each of the one or more other segments of the data sequence that potentially align to the hit, recovers an actual segment from the original data sequence; and
      
      program code that searches each actual segment so recovered to locate one or more hits passing a second, fine-grained threshold.
  - 3. The computer program product as described in claim 1, wherein the first program code comprises:
    - program code to iterate through a stream of the genomic sequence data;
      
      program code that is responsive to a determination that a fragment of the genomic sequence is sufficiently similar to a previously-identified fragment, to associate the fragment with a pointer together with a data string identifying one or more edits that, when applied to the previously-identified fragment, are useful to produce the fragment identified by the pointer; and
      
      program code that, with respect to one or more segments of the stream that are not sufficiently similar to any previously-identified fragment, store the one or more segments in the first portion of the data structure; and
      
      program code to store each pointer in a second portion of the data structure.
  - 4. The computer program product as described in claim 3 wherein the pointer identifies a position of the fragment within the stream, and a position of the previously-identified fragment within the first portion.
  - 5. The computer program product as described in claim 4 wherein the pointer also includes an identifier associated with the data string.
  - 6. The computer program product as described in claim 5 wherein the data string encodes one or more differences between the fragment and the previously-identified fragment, the differences being represented by one of:
    - an insertion, a substitution, and a deletion.
  - 19. The computer program product as described in claim 1 wherein the first portion of the data structure comprises the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to a hierarchy of similar fragments replaced by links to similarly-identified fragments.

7. Apparatus, comprising:
- a processor;
  
  computer memory storing a first program to preprocess a stream of genomic sequence data into first and second data sets, and a second program;
  
  a first data store in which the first data set is stored, the first data set representing one or more portions of the stream that are unique;
  
  a second data store in which the second data set is stored, the second data set comprising one or more pointers, each pointer being associated with a fragment of the stream that has been identified as being sufficiently similar to a previously-identified fragment, the pointer having associated therewith a data string identifying one or more edits that, when applied to the previously-identified fragment, may be used to produce the fragment identified by the pointer;
  
  wherein the second program executes a genomic search algorithm against the first and second data sets to attempt to identify a match to a query string.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The apparatus as described in claim 7 wherein the genomic search algorithm is executed against the first data set using a first threshold to identify one or more potential matches.
  - 9. The apparatus as described in claim 8 wherein at least one potential match has a data string associated therewith.
  - 10. The apparatus as described in claim 9 wherein the genomic search algorithm is executed against fragment data recovered from the data string to further attempt to identify the match.
  - 11. The apparatus as described in claim 10 wherein the genomic search algorithm is executed against the fragment data recovered from the data string using a second threshold, the second threshold being finer than the first threshold.
  - 12. The apparatus as described in claim 7 wherein the genomic search algorithm is BLAST.
  - 13. The apparatus as described in claim 7 wherein the genomic search algorithm is BLAT.

14. A method, executed in one or more computing entities, comprising:
- compressing an original data sequence into a data structure having first and second portions, the first portion comprising the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to previously-identified fragments being replaced by links, the second portion comprising the links; and
  
  in response to a query, searching the first and second portions of the data structure, in lieu of the original data sequence, to identify a portion of the genomic sequence data.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method as described in claim 14 wherein the searching step comprises:
    - searching the first portion to locate one or more hits passing a first, coarse threshold;
      
      for each of the one or more hits, examining the second portion to identify other segments of the data sequence that potentially align to the hit;
      
      for each of the one or more other segments of the data sequence that potentially align to the hit, recovering an actual segment from the original data sequence; and
      
      searching each actual segment so recovered to locate one or more hits passing a second, fine-grained threshold.
  - 16. The method as described in claim 14 wherein the original data sequence is a genomic sequence.
  - 17. The method as described in claim 14 wherein the search of the data structure is carried out using one of:
    - BLAST, and BLAT.
  - 18. The method as described in claim 14 wherein the compressing step is executed to trade-off execution speed against an amount of data stored in the first portion of the data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bonnie Berger Leighton, Michael H. Baym, Po-Ru Loh
Original Assignee
Bonnie Berger Leighton, Michael H. Baym, Po-Ru Loh
Inventors
Baym, Michael H., Leighton, Bonnie Berger, Loh, Po-Ru

Granted Patent

US 9,715,574 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/693
CPC Class Codes

G16B 50/00   ICT programming tools or da...

G16B 50/30   Data warehousing; Computing...

G16B 50/50   Compression of genetic data

H03M 7/3062   Compressive sampling or sen...

Compressing, storing and searching sequence data

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Compressing, storing and searching sequence data

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links