Systems and methods for using paired-end data in directed acyclic structure

US 10,055,539 B2
Filed: 07/14/2015
Issued: 08/21/2018
Est. Priority Date: 10/21/2013
Status: Active Grant

First Claim

Patent Images

1. A system for analyzing a transcriptome, the system comprising:

a processor coupled to the memory, wherein the system is operable to;

obtain, from an annotated transcriptome database, a plurality of exons and introns from a genome;

use the processor to transform the plurality of exons and introns into a directed acyclic data structure comprising nodes representing known RNA sequences and edges connecting the nodes;

obtain a pair of paired-end reads generated by sequencing a transcriptome of an organism;

use the processor to transform the first read of the pair into an alignment with an optimal score between that first read of the pair and a node in the directed acyclic data structure;

identify, using the processor, candidate paths within the directed acyclic data structure that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads;

exclude non-candidate paths from alignments involving the pair of paired-end reads;

align, using the processor, the paired-end reads to the candidate paths to determine an optimal-scoring alignment by;

calculating match scores between a second read of the pair and nodes in the candidate paths, andlooking backwards to predecessor nodes in the candidate paths while not considering any nodes in the non-candidate paths to identify a back-trace through the candidate paths that gives an optimal score,wherein the back-trace that gives the optimal score corresponds to an optimal scoring alignment of the pair of paired-end reads to the candidate paths, andwherein the directed acyclic data structure held in the memory prior to obtaining the pair of paired-end reads includes at least one path that has a node that the second read of the pair aligns to but that is not included during the aligning step due to being excluded as a noncandidate path; and

identify an isoform of an RNA from the organism using the optimal scoring alignment of the paired-end reads.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods of analyzing a transcriptome that involves obtaining at least one pair of paired-end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal-scoring alignment.

Citations

20 Claims

1. A system for analyzing a transcriptome, the system comprising:
- a processor coupled to the memory, wherein the system is operable to;
  
  obtain, from an annotated transcriptome database, a plurality of exons and introns from a genome;
  
  use the processor to transform the plurality of exons and introns into a directed acyclic data structure comprising nodes representing known RNA sequences and edges connecting the nodes;
  
  obtain a pair of paired-end reads generated by sequencing a transcriptome of an organism;
  
  use the processor to transform the first read of the pair into an alignment with an optimal score between that first read of the pair and a node in the directed acyclic data structure;
  
  identify, using the processor, candidate paths within the directed acyclic data structure that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads;
  
  exclude non-candidate paths from alignments involving the pair of paired-end reads;
  
  align, using the processor, the paired-end reads to the candidate paths to determine an optimal-scoring alignment by;
  
  calculating match scores between a second read of the pair and nodes in the candidate paths, andlooking backwards to predecessor nodes in the candidate paths while not considering any nodes in the non-candidate paths to identify a back-trace through the candidate paths that gives an optimal score,wherein the back-trace that gives the optimal score corresponds to an optimal scoring alignment of the pair of paired-end reads to the candidate paths, andwherein the directed acyclic data structure held in the memory prior to obtaining the pair of paired-end reads includes at least one path that has a node that the second read of the pair aligns to but that is not included during the aligning step due to being excluded as a noncandidate path; and
  
  identify an isoform of an RNA from the organism using the optimal scoring alignment of the paired-end reads.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, further operable to determine optimal-scoring alignments for each of the plurality of pairs of paired-end reads.
  - 3. The system of claim 2, further comprising inferring an isoform frequency for the isoform based on the optimal-scoring alignments for each of the plurality of pairs of paired-end reads.
  - 4. The system of claim 1, wherein the identified isoform is a novel isoform and the processor is operable to update the directed acyclic data structure to represent the novel isoform.
  - 5. The system of claim 4, wherein updating the directed acyclic data structure to represent the novel isoform comprises adding at least one new node.
  - 6. The system of claim 5, wherein the node and the downstream node represent a pair of exons from which the pair of paired-end reads were obtained.
  - 7. The system of claim 1, further operable to:
    - align a plurality of pairs of paired-end reads to the directed acyclic data structure; and
      
      determine a set of isoform paths, wherein each path of the determined set of isoform paths represents a transcript isoform present in the organism.
  - 8. The system of claim 1, wherein the known RNA sequences represent a plurality of exons and introns from features in the annotated transcriptome database, further wherein the edges connect pairs of the nodes in their canonical genomic order.
  - 9. The system of claim 1, wherein the alignment with an optimal score between the first read of the pair and a node in the directed acyclic data structure comprises an alignment with an optimal score between a first read of the pair and a node including an exon.

10. A system for analyzing a transcriptome, the system comprising a processor coupled to a memory and operable to:
- obtain a pair of paired-end reads from a transcriptome;
  
  find an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure, the data structure comprising nodes representing RNA sequences and edges connecting pairs of the nodes,identify candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads;
  
  exclude any paths that are not candidate paths from any alignment calculations involving the pair of paired-end reads; and
  
  align the paired-end reads to the candidate paths to determine an optimal-scoring alignment by;
  
  calculating match scores between a second read of the pair and nodes in the candidate paths, andlooking backwards to predecessor nodes in the candidate paths while not considering any nodes in the non-candidate paths to identify a back-trace through the candidate paths that gives an optimal score,wherein the back-trace that gives the optimal score corresponds to the optimal scoring alignment of the pair of paired-end reads to the candidate paths, andwherein the directed acyclic data structure held in the memory prior to obtaining the pair of paired-end reads includes at least one path that had a node that the second reads of the pair aligns to but is not included during the aligning step due to being excluded as a non-candidate path.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 11. The system of claim 10, further operable to obtain a plurality of paired-end reads from the transcriptome.
  - 12. The system of claim 11, further operable to determine optimal-scoring alignments for each of the plurality of pairs of paired-end reads.
  - 13. The system of claim 12, wherein the plurality of paired-end reads are obtained from cDNA fragments from the transcriptome.
  - 14. The system of claim 10, further operable to identify an isoform based on the determined optimal-scoring alignment.
  - 15. The system of claim 14, wherein the identified isoform is a novel isoform.
  - 16. The system of claim 15, further operable to update the directed acyclic data structure to represent the novel isoform.
  - 17. The system of claim 16, wherein updating the directed acyclic data structure to represent the novel isoform comprises adding at least one new node.
  - 18. The system of claim 17, wherein the node and the downstream node represent a pair of exons from which the pair of paired-end reads were obtained.
  - 19. The system of claim 10, further operable to:
    - align a plurality of pairs of paired-end reads to the directed acyclic data structure;
      
      determine distances between the aligned pairs of the plurality of pairs of paired-end reads and the frequencies of the distances between the aligned pairs; and
      
      determine a set of isoform paths and isoform frequencies such that representing the isoform paths through the structure at the isoform frequencies results in pairs of features being included in the isoform paths at the frequencies of the distances between the aligned pairs.
  - 20. The system of claim 19, wherein each path of the determined set of isoform paths represents a transcript isoform present in the organism.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Seven Bridges Genomics, Inc.
Original Assignee
Seven Bridges Genomics, Inc.
Inventors
Kural, Deniz, Meyvis, Nathan
Primary Examiner(s)
Clow, Lori A.

Application Number

US14/798,686
Publication Number

US 20150310167A1
Time in Patent Office

1,134 Days
Field of Search

None
US Class Current
CPC Class Codes

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 50/00   ICT programming tools or da...

G16B 50/10   Ontologies; Annotations

G16H 50/00   ICT specially adapted for m...

Systems and methods for using paired-end data in directed acyclic structure

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for using paired-end data in directed acyclic structure

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links