Compressing, storing and searching sequence data
First Claim
1. A computer program product comprising a non-transitory machine-readable medium that stores a program, the program being executed by a machine to perform a method of compressing and searching genomic sequence data, comprising:
- first program code adapted to compress an original data sequence into a data structure having first and second portions, the first portion comprising the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to previously-identified fragments being replaced by links to similarly-identified fragments, the second portion comprising the links; and
second program code adapted to use the first and second portions of the data structure, in lieu of the original data sequence, to identify a portion of the genomic sequence data in response to a query.
0 Assignments
0 Petitions
Accused Products
Abstract
The redundancy in genomic sequence data is exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods that are referred to herein as “compressive” algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. In this approach, the redundancy among genomes is translated into computational acceleration by storing genomes in a compressed format that respects the structure of similarities and differences important to analysis. Specifically, these differences are the nucleotide substitutions, insertions, deletions, and rearrangements introduced by evolution. Once such a compressed library has been created, analysis is performed on it in time proportional to its compressed size, rather than having to reconstruct the full data set every time one wishes to query it.
15 Citations
19 Claims
-
1. A computer program product comprising a non-transitory machine-readable medium that stores a program, the program being executed by a machine to perform a method of compressing and searching genomic sequence data, comprising:
-
first program code adapted to compress an original data sequence into a data structure having first and second portions, the first portion comprising the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to previously-identified fragments being replaced by links to similarly-identified fragments, the second portion comprising the links; and second program code adapted to use the first and second portions of the data structure, in lieu of the original data sequence, to identify a portion of the genomic sequence data in response to a query. - View Dependent Claims (2, 3, 4, 5, 6, 19)
-
-
7. Apparatus, comprising:
-
a processor; computer memory storing a first program to preprocess a stream of genomic sequence data into first and second data sets, and a second program; a first data store in which the first data set is stored, the first data set representing one or more portions of the stream that are unique; a second data store in which the second data set is stored, the second data set comprising one or more pointers, each pointer being associated with a fragment of the stream that has been identified as being sufficiently similar to a previously-identified fragment, the pointer having associated therewith a data string identifying one or more edits that, when applied to the previously-identified fragment, may be used to produce the fragment identified by the pointer; wherein the second program executes a genomic search algorithm against the first and second data sets to attempt to identify a match to a query string. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A method, executed in one or more computing entities, comprising:
-
compressing an original data sequence into a data structure having first and second portions, the first portion comprising the original data sequence with one or more sequence fragments therein that have been found sufficiently similar to previously-identified fragments being replaced by links, the second portion comprising the links; and in response to a query, searching the first and second portions of the data structure, in lieu of the original data sequence, to identify a portion of the genomic sequence data. - View Dependent Claims (15, 16, 17, 18)
-
Specification