Long fragment de novo assembly using short reads
First Claim
1. A method of determining a sequence of a first chromosomal region of an organism, the method comprising:
- receiving, at a computer system, sequence data from a sequencing of a plurality of nucleic acid molecules of the organism, wherein the sequence data for each of the plurality of nucleic acid molecules includes;
one or more sequence reads of at least one portion of the nucleic acid molecule, anda label corresponding to the one or more sequence reads, the label indicating an origin of the nucleic acid molecule, wherein the sequence data includes at least 1,000 sequence reads;
receiving, by the computer system, a first contig of the first chromosomal region;
analyzing the at least 1,000 sequence reads to determine a group of sequence reads of the sequence data that overlap with an end sequence of the first contig, the group including sequence reads with a plurality of different labels indicating different origins of corresponding nucleic acid molecules, the different origins including different haplotypes; and
extending, by the computer system, the first contig using the group of sequence reads of the sequence data that overlap with the end sequence of the first contig.
4 Assignments
0 Petitions
Accused Products
Abstract
Techniques perform de novo assembly. The assembly can use labels that indicate origins of the nucleic acid molecules. For example, a representative set of labels identified from initial reads that overlap with a seed can be used. Mate pair information can be used. A sequence read that aligns to an end of a contig can lead to using the other sequence read of a mate pair, and the other sequence read can be used to determine which branch to use to extend, e.g., in an external cloud or helper contig. A kmer index can include labels indicating an origin of each of the nucleic acid molecules that include each kmer, memory addresses of the reads that correspond to each kmer in the index, and a position in each of the mate pairs that includes the kmer. Haploid seeds can also be determined using polymorphic loci identified in a population.
54 Citations
32 Claims
-
1. A method of determining a sequence of a first chromosomal region of an organism, the method comprising:
-
receiving, at a computer system, sequence data from a sequencing of a plurality of nucleic acid molecules of the organism, wherein the sequence data for each of the plurality of nucleic acid molecules includes; one or more sequence reads of at least one portion of the nucleic acid molecule, and a label corresponding to the one or more sequence reads, the label indicating an origin of the nucleic acid molecule, wherein the sequence data includes at least 1,000 sequence reads; receiving, by the computer system, a first contig of the first chromosomal region; analyzing the at least 1,000 sequence reads to determine a group of sequence reads of the sequence data that overlap with an end sequence of the first contig, the group including sequence reads with a plurality of different labels indicating different origins of corresponding nucleic acid molecules, the different origins including different haplotypes; and extending, by the computer system, the first contig using the group of sequence reads of the sequence data that overlap with the end sequence of the first contig. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
-
-
32. A computer product comprising a non-transitory computer readable medium storing a plurality of instructions that when executed control a computer system to determine a sequence of a first chromosomal region of an organism, the instructions comprising:
-
receiving sequence data from a sequencing of a plurality of nucleic acid molecules of the organism, wherein the sequence data for each of the plurality of nucleic acid molecules includes; one or more sequence reads of at least one portion of the nucleic acid molecule, and a label corresponding to the one or more sequence reads, the label indicating an origin of the nucleic acid molecule, wherein the sequence data includes at least 1,000 sequence reads; receiving a first contig of the first chromosomal region; analyzing the at least 1,000 sequence reads to determine a group of sequence reads of the sequence data that overlap with an end sequence of the first contig, the group including sequence reads with a plurality of different labels indicating different origins of corresponding nucleic acid molecules, the different origins including different haplotypes; and extending, by the computer system, the first contig using the group of sequence reads of the sequence data that overlap with the end sequence of the first contig.
-
Specification