SYSTEMS AND METHODS FOR GROUPING AND COLLAPSING SEQUENCING READS

US 20200135298A1
Filed: 10/29/2019
Published: 04/30/2020
Est. Priority Date: 10/31/2018
Status: Active Grant

First Claim

Patent Images

1. A system for determining a nucleotide sequence from nucleotide sequencing reads, comprising:

a non-transitory memory configured to store executable instructions and a first hash data structure for storing nucleotide sequencing reads in a plurality of bins; and

a hardware processor programmed by the executable instructions to perform a method comprising;

receiving a plurality of first nucleotide sequencing reads;

for each first nucleotide sequencing read;

generating a plurality of first identifier subsequences from a first identifier sequence of the first nucleotide sequencing read;

generating a first signature for the first nucleotide sequencing read by applying hashing to the plurality of first identifier subsequences; and

assigning the first nucleotide sequencing read to at least one first particular bin of the first hash data structure based on the first signature; and

determining a nucleotide sequence for each first particular bin of the first hash data structure with one or more first nucleotide sequencing reads assigned.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are systems and methods for collapsing sequencing reads and identifying similar sequencing reads. In one example, a method includes generating a plurality of first identifier subsequences from a first identifier sequence of each nucleotide sequencing read and generating a first signature for the nucleotide sequencing read by applying hashing to the plurality of first identifier subsequences. The method may include assigning the nucleotide sequencing read to a first particular bin of a first data structure based on the first signature and determining a nucleotide sequence for each first particular bin of the first data structure with one or more nucleotide sequencing reads assigned.

1 Citation

29 Claims

1. A system for determining a nucleotide sequence from nucleotide sequencing reads, comprising:
- a non-transitory memory configured to store executable instructions and a first hash data structure for storing nucleotide sequencing reads in a plurality of bins; and
  
  a hardware processor programmed by the executable instructions to perform a method comprising;
  
  receiving a plurality of first nucleotide sequencing reads;
  
  for each first nucleotide sequencing read;
  
  generating a plurality of first identifier subsequences from a first identifier sequence of the first nucleotide sequencing read;
  
  generating a first signature for the first nucleotide sequencing read by applying hashing to the plurality of first identifier subsequences; and
  
  assigning the first nucleotide sequencing read to at least one first particular bin of the first hash data structure based on the first signature; and
  
  determining a nucleotide sequence for each first particular bin of the first hash data structure with one or more first nucleotide sequencing reads assigned.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The system of claim 1, wherein assigning the first nucleotide sequencing read comprises:
    - determining a plurality of subsequences of the first signature from the first signature of the first nucleotide sequencing read; and
      
      assigning the first nucleotide sequencing read to a first particular bin of each first hash data structure of a plurality of first hash data structures based on a subsequence of the first signature.
  - 3. The system of claim 1, wherein assigning the first nucleotide sequencing read comprises:
    - determining a plurality of subsequences of the first signature from the first signature of the first nucleotide sequencing read; and
      
      assigning the first nucleotide sequencing read to a plurality of first particular bins of the first hash data structure based on the plurality of subsequences of the first signature.
  - 4. The system of claim 1, wherein the first particular bin is an existing bin of the first hash data structure, and wherein an alignment score of the first nucleotide sequencing read and another first nucleotide sequencing read assigned to the first particular bin of the first hash data structure is above an alignment score threshold.
  - 5. The system of claim 1, wherein the first particular bin is an existing bin of the first hash data structure, and wherein the highest alignment score of the first nucleotide sequencing read and any first nucleotide sequencing read assigned to the first particular bin of the first hash data structure is above an alignment score threshold.
  - 6. The system of claim 1, wherein the first particular bin is a new bin of the first hash data structure, and wherein an alignment score of the first nucleotide sequencing read and any first nucleotide sequencing read assigned to any existing bin of the first hash data structure is below an alignment score threshold.
  - 7. The system of claim 1, wherein the first signature matches a key of the first particular bin of the first hash data structure.
  - 8. The system of claim 1, wherein the first signature and the key of the first particular bin of the first hash data structure are identical.
  - 9. The system of claim 1, wherein each first nucleotide sequencing read is associated with a second nucleotide sequencing read, and wherein the first nucleotide sequencing read and the second nucleotide sequencing read form paired-end nucleotide sequencing reads.
  - 10. The system of claim 1, wherein determining the nucleotide sequence comprises determining a consensus sequence of the one or more first nucleotide sequencing reads assigned to the first particular bin.
  - 11. The system of claim 10, wherein determining the consensus sequence comprises determining a first nucleotide sequencing read with a highest quality score assigned to the first particular bin as the consensus sequence of the first particular bin.
  - 12. The system of claim 1, wherein determining the nucleotide sequence comprises selecting a sequence of the one or more first nucleotide sequencing reads assigned to the first particular bin as a representative sequence of the first particular bin.
  - 13. The system of claim 1, wherein determining the nucleotide sequence comprises determining an alignment score of two of the one or more first nucleotide sequencing reads assigned to the first particular bin is above an alignment score threshold.
  - 14. The system of claim 1, wherein the plurality of nucleotide sequencing reads is associated with an identical physical identifier sequence.
  - 15. The system of claim 1, wherein the plurality of nucleotide sequencing reads is not associated any physical identifier sequence.

16. A computer-implemented method for determining a nucleotide sequence from nucleotide sequencing reads, comprising:
- receiving a plurality of first nucleotide sequencing reads;
  
  for each first nucleotide sequencing read;
  
  generating a plurality of first identifier subsequences from a first identifier sequence of the first nucleotide sequencing read;
  
  generating a first signature for the first nucleotide sequencing read by applying hashing to the plurality of first identifier subsequences; and
  
  assigning the first nucleotide sequencing read to a first particular bin of a first data structure based on the first signature; and
  
  determining a nucleotide sequence for each first particular bin of the first data structure with one or more first nucleotide sequencing reads assigned.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The method of claim 16, wherein generating the plurality of first identifier subsequences comprises generating a plurality of k-mers from the first identifier sequence of the sequencing read.
  - 18. The method of claim 17, wherein the subsequence comprises a nucleotide insertion, a nucleotide deletion, a nucleotide substitution, or a combination thereof.
  - 19. The method of claim 17, wherein two consecutive first identifier subsequences overlap.
  - 20. The method of claim 17, wherein the plurality of first identifier subsequences comprises a plurality of 4-mers, and wherein the first identifier sequence comprises about 25 nucleotides.
  - 21. The method of claim 17, wherein the first identifier sequence is a subsequence of the sequencing read 1.
  - 22. The system of claim 17, wherein generating the first signature comprises determining a plurality of hashes for each first identifier subsequence.
  - 23. The system of claim 17, wherein the first data structure comprises a hash table.

24. A system for identifying similar nucleotide sequencing reads, comprising:
- non-transitory memory configured to store;
  
  executable instructions,a first hash data structure and a second hash data structure for storing a plurality of pairs of sequencing reads; and
  
  a hardware processor programmed by the executable instructions to perform a method comprising;
  
  receiving a pair of a first query nucleotide sequencing read and a second query nucleotide sequencing read;
  
  generating a plurality of first query identifier subsequences and a plurality of second query identifier subsequences from the first query nucleotide sequencing read and the second query nucleotide sequencing read, respectively;
  
  generating a first query signature and a second query signature for the first nucleotide sequencing read and the second nucleotide sequencing read, respectively, by applying hashing to the plurality of first query identifier subsequences and the plurality of second query identifier subsequences, respectively;
  
  retrieving one or more first stored pairs and one or more second stored pairs from the first hash data structure and the second hash data structure using the first query signature and the second query signature, respectively, wherein each of the first pairs and the second pairs comprises a first stored nucleotide sequencing read and a second stored nucleotide sequencing read; and
  
  determining each pair of a first stored nucleotide sequencing read and a second stored nucleotide sequencing read present in both the first stored pairs and second stored pairs as a sequencing read 1 and sequencing read 2 similar to the query sequencing read 1 and the query sequencing read 2, respectively.
- View Dependent Claims (25, 26, 27)
- - 25. The system of claim 24, wherein each pair of sequencing reads comprises a first nucleotide sequencing read and a second nucleotide sequencing read, wherein each pair of sequencing reads is assigned to one of a plurality of first bins of the first hash data structure based on a first signature of a first nucleotide sequencing read of the pair generated by hashing first identifier subsequences of a first identifier sequence of the first nucleotide sequencing read, and wherein each pair of sequencing reads is assigned to one of a plurality of second bins of the second hash data structure based on a second signature of a second nucleotide sequencing read of the pair generated by hashing second identifier sequences of the second nucleotide sequencing read.
  - 26. The system of claim 25, wherein the hardware processor is programmed by the executable instructions to perform the method comprising:
    - for each pair of sequencing reads;
      
      generating a plurality of first identifier subsequences from a first identifier sequence of the first nucleotide sequencing read of the pair of sequencing reads;
      
      generating a first signature for the first nucleotide sequencing read by applying hashing to the plurality of first identifier subsequences; and
      
      assigning the pair of sequencing reads to at least one first particular bin of the first hash data structure based on the first signature; and
      
      determining a nucleotide sequence for each first particular bin of the first hash data structure with one of more pairs of first nucleotide sequencing reads and second nucleotide sequencing reads assigned from the first nucleotide sequencing reads and the second nucleotide sequencing reads of the one or more pairs.
  - 27. The system of claim 25, wherein each pair of sequencing reads is associated with a first identifier sequence and a second identifier sequence, and wherein the hardware processor is programmed by the executable instructions to perform the method comprising:
    - determining the first identifier sequence and the second identifier sequence of a first pair of sequencing reads and the second identifier sequence and the first identifier sequence of a second pair of sequencing reads are identical; and
      
      determining a nucleotide sequence of the first pair of sequencing reads and the second pair of sequencing reads.

28. A method for identifying similar nucleotide sequencing reads, comprising:
- receiving a first query nucleotide sequencing read;
  
  generating a plurality of first query identifier subsequences from the first query nucleotide sequencing read;
  
  generating a first query signature for the first nucleotide sequencing read by applying hashing to the plurality of first query identifier subsequences; and
  
  retrieving one or more first stored nucleotide sequencing reads from a first hash data structure using the first query signature, wherein each of the first stored nucleotide sequencing reads is similar to the first query nucleotide sequencing read.
- View Dependent Claims (29)
- - 29. The method of claim 28, wherein receiving the first query nucleotide sequencing read comprises receiving a pair of the first query nucleotide sequencing read and a second query nucleotide sequencing read, wherein generating the plurality of first query identifier subsequences comprises generating a plurality of second query identifier subsequences from the second nucleotide sequencing read, wherein generating the first query signature comprises generating a second query signature for the second nucleotide sequencing read by applying hashing to the plurality of second query identifier subsequences, and wherein retrieving one or more first stored nucleotide sequencing reads comprises retrieving one or more first stored pairs from the first hash data structure, storing a plurality of pairs of sequencing reads, using the first query signature and the second query signature, wherein each of the first pairs comprises a first stored nucleotide sequencing read and a second stored nucleotide sequencing read similar to the first query nucleotide sequencing read and the second query nucleotide sequencing read, respectively.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Illumina Incorporated
Original Assignee
Illumina Incorporated
Inventors
Zhao, Chen, Wu, Kevin Eric, Bilke, Sven

Granted Patent

US 11,688,489 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/2255   Hash tables

G06F 16/24578   using ranking

G16B 30/10   Sequence alignment; Homolog...

G16B 30/20   Sequence assembly

SYSTEMS AND METHODS FOR GROUPING AND COLLAPSING SEQUENCING READS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

1 Citation

29 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEMS AND METHODS FOR GROUPING AND COLLAPSING SEQUENCING READS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

1 Citation

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links