Nucleic acid sequence assembly
First Claim
Patent Images
1. A method for nucleic acid sequence data assembly, comprising:
- (a) obtaining purified DNA;
(b) binding the purified DNA with a DNA binding agent to form DNA/chromatin complexes;
(c) incubating the DNA-chromatin complexes with restriction enzymes to leave sticky ends;
(d) performing ligation to join ends of DNA;
(e) sequencing ligated DNA junctions to generate paired end reads;
(f) obtaining standard paired-end read distance frequency data;
(g) obtaining grouped contig sequences; and
(h) scaffolding the grouped contig sequences such that read pair distance frequency data for read pairs that map to separate contigs approximates the standard paired-end read distance frequency data,thereby assembling the sequence data of the nucleic acid.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are compositions, systems and methods related to sequence assembly, such as nucleic acid sequence assembly of single reads and contigs into larger contigs and scaffolds through the use of read pair sequence information, such as read pair information indicative of nucleic acid sequence phase or physical linkage.
113 Citations
34 Claims
-
1. A method for nucleic acid sequence data assembly, comprising:
-
(a) obtaining purified DNA; (b) binding the purified DNA with a DNA binding agent to form DNA/chromatin complexes; (c) incubating the DNA-chromatin complexes with restriction enzymes to leave sticky ends; (d) performing ligation to join ends of DNA; (e) sequencing ligated DNA junctions to generate paired end reads; (f) obtaining standard paired-end read distance frequency data; (g) obtaining grouped contig sequences; and (h) scaffolding the grouped contig sequences such that read pair distance frequency data for read pairs that map to separate contigs approximates the standard paired-end read distance frequency data, thereby assembling the sequence data of the nucleic acid. - View Dependent Claims (2, 31)
-
-
3. A method for scaffolding contigs of nucleic acid sequence information obtained from a biological sample, said method comprising:
-
(a) obtaining a set of contig sequences having an initial configuration, wherein the contig sequences are obtained by extracting DNA from a biological material and sequencing the DNA; (b) obtaining a set of paired end reads, wherein the set of paired-end reads is obtained by digesting sample DNA to generate internal double strand breaks within the nucleic acid, allowing the double strand breaks to re-ligate randomly to form a plurality of re-ligation junctions, and sequencing across the plurality of re-ligation junctions; (c) obtaining standard paired-end read distance frequency data; (d) grouping contig pairs sharing sequence that coexists in at least one paired end read, thereby generating grouped contigs; and (e) scaffolding the grouped contigs such that read pair distance frequency data for read pairs that map to separate contigs more closely approximates the standard paired-end read distance frequency data by at least 5% relative to the read pair frequency data of the grouped contigs in the initial configuration. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 32)
-
-
18. A method for scaffolding contigs of nucleic acid sequence information comprising:
-
(a) obtaining a set of contig sequences having an initial configuration; (b) obtaining a set of paired end reads; (c) obtaining standard paired-end read distance frequency data; (d) grouping contig pairs sharing sequence that coexists in at least one paired end read, thereby generating grouped contigs; and (e) scaffolding the grouped contigs such that read pair distance frequency data for read pairs that map to separate contigs more closely approximates the standard paired-end read distance frequency data by at least 5% relative to the read pair distance frequency data of the grouped contigs in the initial configuration. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 33)
-
-
26. A method of assembling contig sequence information into at least one scaffold, comprising
(a) obtaining sequence information corresponding to a plurality of contigs, obtaining paired-end read information from a nucleic acid sample represented by the plurality of contigs, and (b) configuring the plurality of contigs such that deviation of a read pair distance parameter from a predicted read pair distance data set is decreased by at least 5% compared to the read pair distance parameter of plurality of contigs in an initial configuration, wherein the configuring occurs in less than 8 hours.
Specification