SYSTEMS AND METHODS FOR DETERMINING STRUCTURAL VARIATION AND PHASING USING VARIANT CALL DATA
First Claim
1. A method of determining a likelihood of a structural variation occurring in a test nucleic acid obtained from a single biological sample, the method comprising:
- at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors;
(A) obtaining a plurality of sequence reads from a plurality of sequencing reactions in which the test nucleic acid is fragmented, whereineach respective sequence read in the plurality of sequence reads comprises a first portion that corresponds to a subset of the test nucleic acid and a second portion that encodes a respective barcode for the respective sequence read in a plurality of barcodes, andeach respective barcode is independent of the sequencing data of the test nucleic acid, andthe plurality of sequence reads collectively include the plurality of barcodes;
(B) obtaining bin information for a plurality of bins, whereineach respective bin in the plurality of bins represents a different portion of the test nucleic acid,the bin information identifies, for each respective bin in the plurality of bins, a set of sequence reads in a plurality of sets of sequence reads that are in the plurality of sequence reads, andthe respective first portion of each respective sequence read in each respective set of sequence reads in the plurality of sets of sequence reads corresponds to a subset of the test nucleic acid that at least partially overlaps the different portion of the test nucleic acid that is represented by the bin corresponding to the respective set of sequence reads;
(C) identifying, from among the plurality of bins, a first bin and a second bin that correspond to portions of the test nucleic acid that are nonoverlapping, wherein the first bin is represented by a first set of sequence reads in the plurality of sequence reads and the second bin is represented by a second set of sequence reads in the plurality of sequence reads;
(D) determining a first value that represents a numeric probability or likelihood that the number of barcodes common to the first set and the second set is attributable to chance;
(E) responsive to a determination that the first value satisfies a predetermined cut-off value, for each barcode that is common to the first bin and the second bin, obtaining a fragment pair thereby obtaining one or more fragment pairs, each fragment pair in the one or more fragment pairs (i) corresponding to a different barcode that is common to the first bin and the second bin and (ii) consisting of a different first calculated fragment and a different second calculated fragment, wherein, for each respective fragment pair in the one or more fragment pairs;
the different first calculated fragment consists of a respective first subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, whereineach sequence read in the respective first subset of sequence reads is within a predefined genetic distance of another sequence read in the respective first subset of sequence reads,the different first calculated fragment of the respective fragment pair originates with a first sequence read having the barcode corresponding to the respective fragment pair in the first bin, andeach sequence read in the respective first subset of sequence reads is from the first bin, andthe different second calculated fragment consists of a respective second subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, whereineach sequence read in the respective second subset of sequence reads is within a predefined genetic distance of another sequence read in the respective second subset of sequence reads,the different second calculated fragment of the respective fragment pair originates with a second sequence read having the barcode corresponding to the respective fragment pair in the second bin, andeach sequence read in the respective second subset of sequence reads is from the second bin; and
(F) computing a respective likelihood based upon a probability of occurrence of a first model and a probability of occurrence of a second model regarding the one or more fragment pairs to thereby provide a likelihood of a structural variation in the test nucleic acid, wherein(i) the first model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given no structural variation in the target nucleic acid sequence and are part of a common molecule, and(ii) the second model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given structural variation in the target nucleic acid sequence.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for determining structural variation and phasing using variant call data obtained from nucleic acid of a biological sample are provided. Sequence reads are obtained, each comprising a portion corresponding to a subset of the test nucleic acid and a portion encoding a barcode independent of the sequencing data. Bin information is obtained. Each bin represents a different portion of the sample nucleic acid. Each bin corresponds to a set of sequence reads in a plurality of sets of sequence reads formed from the sequence reads such that each sequence read in a respective set of sequence reads corresponds to a subset of the nucleic acid represented by the bin corresponding to the respective set. Binomial tests identify bin pairs having more sequence reads with the same barcode in common than expected by chance. Probabilistic models determine structural variation likelihood from the sequence reads of these bin pairs.
-
Citations
53 Claims
-
1. A method of determining a likelihood of a structural variation occurring in a test nucleic acid obtained from a single biological sample, the method comprising:
at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors; (A) obtaining a plurality of sequence reads from a plurality of sequencing reactions in which the test nucleic acid is fragmented, wherein each respective sequence read in the plurality of sequence reads comprises a first portion that corresponds to a subset of the test nucleic acid and a second portion that encodes a respective barcode for the respective sequence read in a plurality of barcodes, and each respective barcode is independent of the sequencing data of the test nucleic acid, and the plurality of sequence reads collectively include the plurality of barcodes; (B) obtaining bin information for a plurality of bins, wherein each respective bin in the plurality of bins represents a different portion of the test nucleic acid, the bin information identifies, for each respective bin in the plurality of bins, a set of sequence reads in a plurality of sets of sequence reads that are in the plurality of sequence reads, and the respective first portion of each respective sequence read in each respective set of sequence reads in the plurality of sets of sequence reads corresponds to a subset of the test nucleic acid that at least partially overlaps the different portion of the test nucleic acid that is represented by the bin corresponding to the respective set of sequence reads; (C) identifying, from among the plurality of bins, a first bin and a second bin that correspond to portions of the test nucleic acid that are nonoverlapping, wherein the first bin is represented by a first set of sequence reads in the plurality of sequence reads and the second bin is represented by a second set of sequence reads in the plurality of sequence reads; (D) determining a first value that represents a numeric probability or likelihood that the number of barcodes common to the first set and the second set is attributable to chance; (E) responsive to a determination that the first value satisfies a predetermined cut-off value, for each barcode that is common to the first bin and the second bin, obtaining a fragment pair thereby obtaining one or more fragment pairs, each fragment pair in the one or more fragment pairs (i) corresponding to a different barcode that is common to the first bin and the second bin and (ii) consisting of a different first calculated fragment and a different second calculated fragment, wherein, for each respective fragment pair in the one or more fragment pairs; the different first calculated fragment consists of a respective first subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective first subset of sequence reads is within a predefined genetic distance of another sequence read in the respective first subset of sequence reads, the different first calculated fragment of the respective fragment pair originates with a first sequence read having the barcode corresponding to the respective fragment pair in the first bin, and each sequence read in the respective first subset of sequence reads is from the first bin, and the different second calculated fragment consists of a respective second subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective second subset of sequence reads is within a predefined genetic distance of another sequence read in the respective second subset of sequence reads, the different second calculated fragment of the respective fragment pair originates with a second sequence read having the barcode corresponding to the respective fragment pair in the second bin, and each sequence read in the respective second subset of sequence reads is from the second bin; and (F) computing a respective likelihood based upon a probability of occurrence of a first model and a probability of occurrence of a second model regarding the one or more fragment pairs to thereby provide a likelihood of a structural variation in the test nucleic acid, wherein (i) the first model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given no structural variation in the target nucleic acid sequence and are part of a common molecule, and (ii) the second model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given structural variation in the target nucleic acid sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
20. A computing system, comprising:
-
one or more processors; memory storing one or more programs to be executed by the one or more processors; the one or more programs comprising instructions for; (A) obtaining a plurality of sequence reads from a plurality of sequencing reactions in which the test nucleic acid is fragmented, wherein each respective sequence read in the plurality of sequence reads comprises a first portion that corresponds to a subset of the test nucleic acid and a second portion that encodes a respective barcode for the respective sequence read in a plurality of barcodes, each respective barcode is independent of the sequencing data of the test nucleic acid, and the plurality of sequence reads collectively include the plurality of barcodes; (B) obtaining bin information for a plurality of bins, wherein each respective bin in the plurality of bins represents a different portion of the test nucleic acid, the bin information identifies, for each respective bin in the plurality of bins, a set of sequence reads in a plurality of sets of sequence reads that are in the plurality of sequence reads, and the respective first portion of each respective sequence read in each respective set of sequence reads in the plurality of sets of sequence reads corresponds to a subset of the test nucleic acid that at least partially overlaps the different portion of the test nucleic acid that is represented by the bin corresponding to the respective set of sequence reads; (C) identifying, from among the plurality of bins, a first bin and a second bin that correspond to portions of the test nucleic acid that are nonoverlapping, wherein the first bin is represented by a first set of sequence reads in the plurality of sequence reads and the second bin is represented by a second set of sequence reads in the plurality of sequence reads; (D) determining a first value that represents a numeric probability or likelihood that the number of barcodes common to the first set and the second set is attributable to chance; (E) responsive to a determination that the first value satisfies a predetermined cut-off value, for each barcode that is common to the first bin and the second bin, obtaining a fragment pair thereby obtaining one or more fragment pairs, each fragment pair in the one or more fragment pairs (i) corresponding to a different barcode that is common to the first bin and the second bin and (ii) consisting of a different first calculated fragment and a different second calculated fragment, wherein, for each respective fragment pair in the one or more fragment pairs; the different first calculated fragment consists of a respective first subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective first subset of sequence reads is within a predefined genetic distance of another sequence read in the respective first subset of sequence reads, the different first calculated fragment of the respective fragment pair originates with a first sequence read having the barcode corresponding to the respective fragment pair in the first bin, and each sequence read in the respective first subset of sequence reads is from the first bin, and the different second calculated fragment consists of a respective second subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective second subset of sequence reads is within a predefined genetic distance of another sequence read in the respective second subset of sequence reads, the different second calculated fragment of the respective fragment pair originates with a second sequence read having the barcode corresponding to the respective fragment pair in the second bin, and each sequence read in the respective second subset of sequence reads is from the second bin; and (F) computing a respective likelihood based upon a probability of occurrence of a first model and a probability of occurrence of a second model regarding the one or more fragment pairs to thereby provide a likelihood of a structural variation in the test nucleic acid, wherein (i) the first model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given no structural variation in the target nucleic acid sequence and are part of a common molecule, and (ii) the second model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given structural variation in the target nucleic acid sequence.
-
-
21. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for:
-
(A) obtaining a plurality of sequence reads from a plurality of sequencing reactions in which the test nucleic acid is fragmented, wherein each respective sequence read in the plurality of sequence reads comprises a first portion that corresponds to a subset of the test nucleic acid and a second portion that encodes a respective barcode for the respective sequence read in a plurality of barcodes, each respective barcode is independent of the sequencing data of the test nucleic acid, and the plurality of sequence reads collectively include the plurality of barcodes; (B) obtaining bin information for a plurality of bins, wherein each respective bin in the plurality of bins represents a different portion of the test nucleic acid, the bin information identifies, for each respective bin in the plurality of bins, a set of sequence reads in a plurality of sets of sequence reads that are in the plurality of sequence reads, and the respective first portion of each respective sequence read in each respective set of sequence reads in the plurality of sets of sequence reads corresponds to a subset of the test nucleic acid that at least partially overlaps the different portion of the test nucleic acid that is represented by the bin corresponding to the respective set of sequence reads; (C) identifying, from among the plurality of bins, a first bin and a second bin that correspond to portions of the test nucleic acid that are nonoverlapping, wherein the first bin is represented by a first set of sequence reads in the plurality of sequence reads and the second bin is represented by a second set of sequence reads in the plurality of sequence reads; (D) determining a first value that represents a numeric probability or likelihood that the number of barcodes common to the first set and the second set is attributable to chance; (E) responsive to a determination that the first value satisfies a predetermined cut-off value, for each barcode that is common to the first bin and the second bin, obtaining a fragment pair thereby obtaining one or more fragment pairs, each fragment pair in the one or more fragment pairs (i) corresponding to a different barcode that is common to the first bin and the second bin and (ii) consisting of a different first calculated fragment and a different second calculated fragment, wherein, for each respective fragment pair in the one or more fragment pairs; the different first calculated fragment consists of a respective first subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective first subset of sequence reads is within a predefined genetic distance of another sequence read in the respective first subset of sequence reads, the different first calculated fragment of the respective fragment pair originates with a first sequence read having the barcode corresponding to the respective fragment pair in the first bin, and each sequence read in the respective first subset of sequence reads is from the first bin, and the different second calculated fragment consists of a respective second subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective second subset of sequence reads is within a predefined genetic distance of another sequence read in the respective second subset of sequence reads, the different second calculated fragment of the respective fragment pair originates with a second sequence read having the barcode corresponding to the respective fragment pair in the second bin, and each sequence read in the respective second subset of sequence reads is from the second bin; and (F) computing a respective likelihood based upon a probability of occurrence of a first model and a probability of occurrence of a second model regarding the one or more fragment pairs to thereby provide a likelihood of a structural variation in the test nucleic acid, wherein (i) the first model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given no structural variation in the target nucleic acid sequence and are part of a common molecule, and (ii) the second model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given structural variation in the target nucleic acid sequence.
-
-
22. A method of phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the method comprising:
at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors; (A) obtaining a reference consensus sequence for all or a portion of a genome of the species; (B) obtaining a plurality of variant calls Ai;
p for the biological sample, whereini is an index to a position in the reference consensus sequence, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to HO and label 1 assigns the respective variant call to H1;(C) obtaining a plurality of sequence reads {right arrow over (O)} for the biological sample, wherein each respective sequence read {right arrow over (O)}i in the plurality of sequence reads comprises a first portion that corresponds to a subset of the reference sequence and a second portion that encodes a respective barcode, independent of the reference sequence, for the respective sequence read, in a plurality of barcodes, and each respective sequence read {right arrow over (O)}i in the plurality of sequence reads is ε
{0, 1, −
}n, wherein (i) n is the number of variants calls in Ai;
p, (ii) each respective label 0 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H0, (iii) each respective label 1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to HO, and (iv) each respective label −
for the respective sequence read {right arrow over (O)}i indicates that the corresponding variant call in Ai;
p is not covered; and(D) refining a phasing result {right arrow over (X)} by optimization of haplotype assignments at individual positions i in Ai;
p between HO and H1 for the plurality of sequence reads using the relationship;
-
23. A method of addressing error in the zygosity of variant calls in phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the method comprising:
at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors; (A) obtaining a reference consensus sequence for all or a portion of a genome of the species; (B) obtaining a plurality of variant calls Ai;
p for the biological sample, whereini is an index to a position in the reference consensus sequence, and pε
{0, 1, −
1} in which label 0 assigns a respective variant call in Ai;
p to HO, label 1 assigns the respective variant call to H1, and label −
1 assigns the respective variant call to the zygosity error condition H−
1;(C) obtaining a plurality of sequence reads {right arrow over (O)} for the biological sample, wherein each respective sequence read {right arrow over (O)}i in the plurality of sequence reads comprises a first portion that corresponds to a subset of the reference sequence and a second portion that encodes a respective barcode, independent of the reference sequence, for the respective sequence read, in a plurality of barcodes, and each respective sequence read {right arrow over (O)}i in the plurality of sequence reads is ε
{0, 1, −
1, −
}n, wherein (i) n is the number of variants calls in Ai;
p, (ii) each respective label 0 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H0, (iii) each respective label 1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to HO, (iv) each respective label −
1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p, to H−
1, and (v) each respective label −
for the respective sequence read {right arrow over (O)}i indicates that the corresponding variant call in Ai;
p is not covered; and(D) refining a phasing vector result {right arrow over (X)} by optimization of haplotype assignments at individual positions i in Ai;
p between HO, H1 and H−
1 for the plurality of sequence reads using an overall objective function;- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
38. A computing system, comprising:
-
one or more processors; memory storing one or more programs to be executed by the one or more processors; the one or more programs comprising instructions for phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), by executing a method comprising; (A) obtaining a reference consensus sequence for all or a portion of a genome of the species; (B) obtaining a plurality of variant calls Ai;
p for the biological sample, whereini is an index to a position in the reference consensus sequence, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to HO and label 1 assigns the respective variant call to H1;(C) obtaining a plurality of sequence reads {right arrow over (O)} for the biological sample, wherein each respective sequence read {right arrow over (O)}i in the plurality of sequence reads comprises a first portion that corresponds to a subset of the reference sequence and a second portion that encodes a respective barcode, independent of the reference sequence, for the respective sequence read, in a plurality of barcodes, and each respective sequence read {right arrow over (O)}i in the plurality of sequence reads is ε
{0, 1, −
}n, wherein (i) n is the number of variants calls in Ai;
p, (ii) each respective label 0 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H0, (iii) each respective label 1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to HO, and (iv) each respective label −
for the respective sequence read {right arrow over (O)}i indicates that the corresponding variant call in Ai;
p is not covered; and(D) refining a phasing result {right arrow over (X)} by optimization of haplotype assignments at individual positions i in Ai;
p between HO and H1 for the plurality of sequence reads using the relationship;
-
-
39. A computing system, comprising:
-
one or more processors; memory storing one or more programs to be executed by the one or more processors; the one or more programs comprising instructions addressing error in the zygosity of variant calls in phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), by executing a method comprising; A) obtaining a reference consensus sequence for all or a portion of a genome of the species; (B) obtaining a plurality of variant calls Ai;
p for the biological sample, whereini is an index to a position in the reference consensus sequence, and pε
{0, 1, −
1} in which label 0 assigns a respective variant call in Ai;
p to HO, label 1 assigns the respective variant call to H1, and label −
1 assigns the respective variant call to the zygosity error condition H−
1;(C) obtaining a plurality of sequence reads {right arrow over (O)} for the biological sample, wherein each respective sequence read {right arrow over (O)}i in the plurality of sequence reads comprises a first portion that corresponds to a subset of the reference sequence and a second portion that encodes a respective barcode, independent of the reference sequence, for the respective sequence read, in a plurality of barcodes, and each respective sequence read {right arrow over (O)}i in the plurality of sequence reads is ε
{0, 1, −
1, −
}n, wherein (i) n is the number of variants calls in Ai;
p, (ii) each respective label 0 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H0, (iii) each respective label 1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to HO, (iv) each respective label −
1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p, to H−
1, and (v) each respective label −
for the respective sequence read {right arrow over (O)}i indicates that the corresponding variant call in Ai;
p is not covered; and(D) refining a phasing vector result {right arrow over (X)} by optimization of haplotype assignments at individual positions i in Ai;
p between HO, H1 and H−
1 for the plurality of sequence reads using an overall objective function; - View Dependent Claims (40)
-
-
41. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the one or more programs collectively executing a method comprising:
-
(A) obtaining a reference consensus sequence for all or a portion of a genome of the species; (B) obtaining a plurality of variant calls Ai;
p for the biological sample, whereini is an index to a position in the reference consensus sequence, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to HO and label 1 assigns the respective variant call to H1;(C) obtaining a plurality of sequence reads {right arrow over (O)} for the biological sample, wherein each respective sequence read {right arrow over (O)}i in the plurality of sequence reads comprises a first portion that corresponds to a subset of the reference sequence and a second portion that encodes a respective barcode, independent of the reference sequence, for the respective sequence read, in a plurality of barcodes, and each respective sequence read {right arrow over (O)}i in the plurality of sequence reads is ε
{0, 1, −
}n, wherein (i) n is the number of variants calls in Ai;
p, (ii) each respective label 0 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H0, (iii) each respective label 1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to HO, and (iv) each respective label −
for the respective sequence read {right arrow over (O)}i indicates that the corresponding variant call in Ai;
p is not covered; and(D) refining a phasing result {right arrow over (X)} by optimization of haplotype assignments at individual positions i in Ai;
p between HO and H1 for the plurality of sequence reads using the relationship;
-
-
42. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for addressing error in the zygosity of variant calls in phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the one or more programs collectively executing a method comprising:
-
(A) obtaining a reference consensus sequence for all or a portion of a genome of the species; (B) obtaining a plurality of variant calls Ai;
p for the biological sample, whereini is an index to a position in the reference consensus sequence, and pε
{0, 1, −
1} in which label 0 assigns a respective variant call in Ai;
p to HO, label 1 assigns the respective variant call to H1, and label −
1 assigns the respective variant call to the zygosity error condition H−
1;(C) obtaining a plurality of sequence reads {right arrow over (O)} for the biological sample, wherein each respective sequence read {right arrow over (O)}i in the plurality of sequence reads comprises a first portion that corresponds to a subset of the reference sequence and a second portion that encodes a respective barcode, independent of the reference sequence, for the respective sequence read, in a plurality of barcodes, and each respective sequence read {right arrow over (O)}i in the plurality of sequence reads is ε
{0, 1, −
1, −
}n, wherein (i) n is the number of variants calls in Ai;
p, (ii) each respective label 0 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H0, (iii) each respective label 1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to HO, (iv) each respective label −
1 for the respective sequence read {right arrow over (O)}i assigns a corresponding variant call in Ai;
p to H−
1, and (v) each respective label −
for the respective sequence read {right arrow over (O)}i indicates that the corresponding variant call in Ai;
p is not covered; and(D) refining a phasing vector result {right arrow over (X)} by optimization of haplotype assignments at individual positions i in Ai;
p between HO, H1 and H−
1 for the plurality of sequence reads using an overall objective function;
-
-
43. The non-transitory computer readable storage medium of claim 43, wherein
(O1,f, . . . ,ON,f|{right arrow over (X)},Hf=0)=Π-
iP(Oi,f|Ai,X
i ),
P(O1,f, . . . ,ON,f|{right arrow over (X)},Hf=1)=Π
iP(Oi,f|Ai,1-Xi ),
P(O1,f, . . . ,ON,f|{right arrow over (X)},Hf=M)=Π
i0.5.M indicates a mixture of Hf=0 and Hf=1 for the respective barcode f,
-
iP(Oi,f|Ai,X
-
44. A method of phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (HO) and a second set of haplotypes (H1), the method comprising:
-
at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors; (A) obtaining a plurality of variant calls Ai;
p for the test nucleic acid sample, whereini is an index to a position in a reference consensus sequence for all or a portion of a genome of the species, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to H=0 and label 1 assigns the respective variant call to H=1;(B) for each respective local block of variant calls in Ai;
p that are localized to a corresponding subset of the reference consensus sequence, using a beam search over the haplotype assignments of local phasing vectors Xk, Xk+1, . . . , Xk+j in the respective local block of variant calls, whereink is the first variant in the respective local block of variant calls, j is a number of variant calls in the respective local block of variant calls, assignments of Xk, Xk+1, . . . , Xk+j are found by computing an objective function in which the phasing vector of the objective function in respective computations is limited to Xk, Xk+1, . . . , Xk+j, and the objective function is calculated by matching observed sequence reads of the test nucleic acid sample against the respective local block of variant calls in Ai;
p,thereby finding a phasing solution for each respective local block of variant calls in Ap; and (C) greedily joining, upon completion of the beam search for each respective local block of variant calls in Ai;
p neighboring local blocks of variant calls in Ai;
p using the phasing solution for each respective local block of variant calls thereby obtaining a phasing configuration {circumflex over (X)} for the single organism of the species.
-
-
45. A method of phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species while accounting for error in variant call zygosity, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the method comprising:
-
at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors; (A) obtaining a plurality of variant calls Ai;
p, whereini is an index to a position in a reference consensus sequence for all or a portion of a genome of the species, and pε
{0, 1, −
1} in which label 0 assigns a respective variant call in Ai;
p to H0, label 1 assigns the respective variant call to H1, and label −
1 assigns the respective variant call to a zygosity error condition H−
1,(B) for each respective local block of variant calls in Ai;
p that are localized to a corresponding subset of the reference consensus sequence, using a beam search over the haplotype assignments of local phasing vectors Xk, Xk+1, . . . , Xk+j in the respective local block of variant calls, whereink is the first variant in the respective local block of variant calls, j is a number of variant calls in the respective local block of variant calls, assignments of Xk, Xk+1, . . . , Xk+j are found by computing an objective function in which the phasing vector of the objective function in respective computations is limited to Xk, Xk+1, . . . , Xk+j, and the objective function is calculated by matching observed sequence reads of the test nucleic acid sample against the respective local block of variant calls in Ai;
p,thereby finding a phasing solution for each respective local block of variant calls in Ai;
p; and(C) greedily joining, upon completion of the beam search for each respective local block of variant calls in Ai;
p, neighboring local blocks of variant calls in Ai;
p using the phasing solution for each respective local block of variant calls thereby obtaining a phasing configuration {circumflex over (X)} for the single organism of the species.- View Dependent Claims (46, 47, 48, 49)
-
-
50. A computing system, comprising:
-
one or more processors; memory storing one or more programs to be executed by the one or more processors; the one or more programs comprising instructions for phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (HO) and a second set of haplotypes (H1) by executing a method comprising; (A) obtaining a plurality of variant calls Ai;
p, whereini is an index to a position in a reference consensus sequence for all or a portion of a genome of the species, and pε
{0, 1, −
1} in which label 0 assigns a respective variant call in Ai;
p to H0, label 1 assigns the respective variant call to H1, and label −
1 assigns the respective variant call to a zygosity error condition H−
1;(B) for each respective local block of variant calls in Ai;
p that are localized to a corresponding subset of the reference consensus sequence, using a beam search over the haplotype assignments of local phasing vectors Xk, Xk+1, . . . , Xk+j in the respective local block of variant calls, whereink is the first variant in the respective local block of variant calls, j is a number of variant calls in the respective local block of variant calls, assignments of Xk, Xk+1, . . . , Xk+j are found by computing an objective function in which the phasing vector of the objective function in respective computations is limited to Xk, Xk+1, . . . , Xk+j, and the objective function is calculated by matching observed sequence reads of the test nucleic acid sample against the respective local block of variant calls in Ai;
p,thereby finding a phasing solution for each respective local block of variant calls in Ai;
p; and(C) greedily joining, upon completion of the beam search for each respective local block of variant calls in Ai;
p, neighboring local blocks of variant calls in Ai;
p using the phasing solution for each respective local block of variant calls thereby obtaining a phasing configuration {circumflex over (X)} for the single organism of the species.
-
-
51. A computing system, comprising:
-
one or more processors; memory storing one or more programs to be executed by the one or more processors; the one or more programs comprising instructions for phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species while accounting for error in variant call zygosity, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the one or more programs executing a method comprising; (A) obtaining a plurality of variant calls Ai;
p, whereini is an index to a position in a reference consensus sequence for all or a portion of a genome of the species, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to HO and label 1 assigns the respective variant call to H1;(B) for each respective local block of variant calls in Ai;
p that are localized to a corresponding subset of the reference consensus sequence, using a beam search over the haplotype assignments of local phasing vectors Xk, Xk+1, . . . , Xk+j in the respective local block of variant calls, whereink is the first variant in the respective local block of variant calls, j is a number of variant calls in the respective local block of variant calls, assignments of Xk, Xk+1, . . . , Xk+j are found by computing an objective function in which the phasing vector of the objective function in respective computations is limited to Xk, Xk+1, . . . , Xk+j, and the objective function is calculated by matching observed sequence reads of the test nucleic acid sample against the respective local block of variant calls in Ai;
p, thereby finding a phasing solution for each respective local block of variant calls in Ai;
p; and(C) greedily joining, upon completion of the beam search for each respective local block of variant calls in Ai;
p, neighboring local blocks of variant calls in Ai;
p using the phasing solution for each respective local block of variant calls thereby obtaining a phasing configuration {circumflex over (X)} for the single organism of the species.
-
-
52. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species, wherein the test nucleic acid sample comprises a first set of haplotypes (HO) and a second set of haplotypes (H1), the one or more programs collectively executing a method comprising:
-
A) obtaining a plurality of variant calls Ai;
p, whereini is an index to a position in a reference consensus sequence for all or a portion of a genome of the species, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to HO and label 1 assigns the respective variant call to H1;(B) for each respective local block of variant calls in Ai;
p that are localized to a corresponding subset of the reference consensus sequence, using a beam search over the haplotype assignments of local phasing vectors Xk, Xk+1, . . . , Xk+j in the respective local block of variant calls, whereink is the first variant in the respective local block of variant calls, j is a number of variant calls in the respective local block of variant calls, assignments of Xk, Xk+1, . . . , Xk+j are found by computing an objective function in which the phasing vector of the objective function in respective computations is limited to Xk, Xk+1, . . . , Xk+j, and the objective function is calculated by matching observed sequence reads of the test nucleic acid sample against the respective local block of variant calls in Ai;
p,thereby finding a phasing solution for each respective local block of variant calls in Ai;
p; and(C) greedily joining, upon completion of the beam search for each respective local block of variant calls in Ai;
p, neighboring local blocks of variant calls in Ai;
p using the phasing solution for each respective local block of variant calls thereby obtaining a phasing configuration {circumflex over (X)} for the single organism of the species.
-
-
53. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for phasing sequencing data of a test nucleic acid sample obtained from a biological sample from a single organism of a species while accounting for error in variant call zygosity, wherein the test nucleic acid sample comprises a first set of haplotypes (H0) and a second set of haplotypes (H1), the one or more programs collectively executing a method comprising:
-
(A) obtaining a plurality of variant calls Ai;
p, whereini is an index to a position in a reference consensus sequence for all or a portion of a genome of the species, and pε
{0, 1} in which label 0 assigns a respective variant call in Ai;
p to H=0 and label 1 assigns the respective variant call to H=1;(B) for each respective local block of variant calls in Ai;
p that are localized to a corresponding subset of the reference consensus sequence, using a beam search over the haplotype assignments of local phasing vectors Xk, Xk+1, . . . , Xk+j in the respective local block of variant calls, whereink is the first variant in the respective local block of variant calls, j is a number of variant calls in the respective local block of variant calls, assignments of Xk, Xk+1, . . . , Xk+j are found by computing an objective function in which the phasing vector of the objective function in respective computations is limited to Xk, Xk+1, . . . , Xk+j, and the objective function is calculated by matching observed sequence reads of the test nucleic acid sample against the respective local block of variant calls in Ai;
p, thereby finding a phasing solution for each respective local block of variant calls in Ai;
p; and(C) greedily joining, upon completion of the beam search for each respective local block of variant calls in Ai;
p, neighboring local blocks of variant calls in Ai;
p using the phasing solution for each respective local block of variant calls thereby obtaining a phasing configuration {circumflex over (X)} for the single organism of the species.
-
Specification