Distributed data processing platform for metagenomic monitoring and characterization

US 10,127,352 B1
Filed: 12/30/2015
Issued: 11/13/2018
Est. Priority Date: 04/06/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

configuring a first processing node for communication with one or more additional processing nodes and with one or more of a plurality of geographically-distributed metagenomics sequencing centers via one or more networks;

processing metagenomics sequencing results obtained from one or more of the metagenomics sequencing centers in the first processing node; and

providing surveillance functionality relating to at least one designated biological issue on behalf of one or more requesting clients based at least in part on the processing of metagenomics sequencing results performed by the first processing node and related processing performed by one or more of the additional processing nodes;

wherein each of the metagenomics sequencing centers is configured to perform metagenomics sequencing on biological samples from respective sample sources in a corresponding data zone;

wherein processing the metagenomics sequencing results further comprises generating a hit abundance score vector for a given one of the biological samples wherein the hit abundance score vector comprises a plurality of entries corresponding to respective occurrence frequencies of at least one read of the given biological sample in respective target genomic sequences;

wherein providing surveillance functionality further comprises;

performing a preprocessing operation to reduce a biclustering sample space of a genomic comparison component;

generating a hit abundance score matrix for the genomic comparison component comprising a plurality of the hit abundance score vectors wherein one of rows and columns of the hit abundance score matrix correspond to respective different ones of the biological samples and the other of the rows and columns of the hit abundance score matrix correspond to respective different ones of the target genomic sequences; and

performing a biclustering operation on the hit abundance score matrix; and

wherein the method is implemented by at least one processing device comprising a processor coupled to a memory.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method comprises configuring a first processing node for communication with one or more additional processing nodes and with one or more of a plurality of geographically-distributed metagenomics sequencing centers via one or more networks, processing metagenomics sequencing results obtained from one or more of the metagenomics sequencing centers in the first processing node, and providing surveillance functionality relating to at least one designated biological issue on behalf of one or more requesting clients based at least in part on the processing of metagenomics sequencing results performed by the first processing node and related processing performed by one or more of the additional processing nodes. Each of the metagenomics sequencing centers is configured to perform metagenomics sequencing on biological samples from respective sample sources in a corresponding data zone.

87 Citations

View as Search Results

20 Claims

1. A method comprising:
- configuring a first processing node for communication with one or more additional processing nodes and with one or more of a plurality of geographically-distributed metagenomics sequencing centers via one or more networks;
  
  processing metagenomics sequencing results obtained from one or more of the metagenomics sequencing centers in the first processing node; and
  
  providing surveillance functionality relating to at least one designated biological issue on behalf of one or more requesting clients based at least in part on the processing of metagenomics sequencing results performed by the first processing node and related processing performed by one or more of the additional processing nodes;
  
  wherein each of the metagenomics sequencing centers is configured to perform metagenomics sequencing on biological samples from respective sample sources in a corresponding data zone;
  
  wherein processing the metagenomics sequencing results further comprises generating a hit abundance score vector for a given one of the biological samples wherein the hit abundance score vector comprises a plurality of entries corresponding to respective occurrence frequencies of at least one read of the given biological sample in respective target genomic sequences;
  
  wherein providing surveillance functionality further comprises;
  
  performing a preprocessing operation to reduce a biclustering sample space of a genomic comparison component;
  
  generating a hit abundance score matrix for the genomic comparison component comprising a plurality of the hit abundance score vectors wherein one of rows and columns of the hit abundance score matrix correspond to respective different ones of the biological samples and the other of the rows and columns of the hit abundance score matrix correspond to respective different ones of the target genomic sequences; and
  
  performing a biclustering operation on the hit abundance score matrix; and
  
  wherein the method is implemented by at least one processing device comprising a processor coupled to a memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 wherein each of at least a subset of the processing nodes comprises at least one worldwide data node configured to perform operations in accordance with at least one supported framework of a YARN cluster on one or more corresponding portions of the metagenomics sequencing results.
  - 3. The method of claim 1 wherein the sample sources comprise one or more of water sources, food sources, agricultural sources and clinical sources.
  - 4. The method of claim 1 wherein the surveillance functionality relating to at least one designated biological issue comprises characterization of at least one of a disease, an infection and a contamination.
  - 5. The method of claim 4 wherein the characterization of at least one of a disease, an infection and a contamination comprises characterizing said disease, infection or contamination as involving genomic material from multiple ones of the biological samples sequenced by different ones of the metagenomics sequencing centers.
  - 6. The method of claim 1 wherein the processing of the metagenomics sequencing results in the first processing node comprises determining if genomic material in the metagenomics sequencing results is present in one or more known genomes.
  - 7. The method of claim 1 wherein the metagenomics sequencing results for a given one of the biological samples comprises a complete sequencing of the biological sample performed without utilization of a culture-based pathogen isolation process.
  - 8. The method of claim 7 wherein the complete sequencing of the biological sample comprises a set of reads for all organisms in the sample.
  - 9. The method of claim 1 wherein the metagenomics sequencing results for a given one of the biological samples comprises a subset of reads for the given biological sample that are determined to match existing reads from other samples.
  - 10. The method of claim 1 wherein the metagenomics sequencing results for a given one of the biological samples comprises a subset of reads for the given biological sample that excludes any reads that match a human genome.
  - 11. The method of claim 1 wherein performing a biclustering operation on the hit abundance score matrix comprises processing the hit abundance score matrix in the form of a bipartite graph in which a first set of nodes represents respective ones of the biological samples, a second set of nodes represents respective ones of the target genomic sequences, and edges between nodes in the first set and nodes in the second set represent hit abundance scores of the hit abundance score vectors of the hit abundance score matrix.

12. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device:
- to configure a first processing node for communication with one or more additional processing nodes and with one or more of a plurality of geographically-distributed metagenomics sequencing centers via one or more networks;
  
  to process metagenomics sequencing results obtained from one or more of the metagenomics sequencing centers in the first processing node; and
  
  to provide surveillance functionality relating to at least one designated biological issue on behalf of one or more requesting clients based at least in part on the processing of metagenomics sequencing results performed by the first processing node and related processing performed by one or more of the additional processing nodes;
  
  wherein each of the metagenomics sequencing centers is configured to perform metagenomics sequencing on biological samples from respective sample sources in a corresponding data zone;
  
  wherein processing the metagenomics sequencing results further comprises generating a hit abundance score vector for a given one of the biological samples wherein the hit abundance score vector comprises a plurality of entries corresponding to respective occurrence frequencies of at least one read of the given biological sample in respective target genomic sequences; and
  
  wherein providing surveillance functionality further comprises;
  
  performing a preprocessing operation to reduce a biclustering sample space of a genomic comparison component;
  
  generating a hit abundance score matrix for the genomic comparison component comprising a plurality of the hit abundance score vectors wherein one of rows and columns of the hit abundance score matrix correspond to respective different ones of the biological samples and the other of the rows and columns of the hit abundance score matrix correspond to respective different ones of the target genomic sequences; and
  
  performing a biclustering operation on the hit abundance score matrix.
- View Dependent Claims (13, 14, 15)
- - 13. The computer program product of claim 12 wherein each of at least a subset of the processing nodes comprises at least one worldwide data node configured to perform operations in accordance with at least one supported framework of a YARN cluster on one or more corresponding portions of the metagenomics sequencing results.
  - 14. The computer program product of claim 12 wherein the surveillance functionality relating to at least one designated biological issue comprises characterization of at least one of a disease, an infection and a contamination.
  - 15. The computer program product of claim 12 wherein performing a biclustering operation on the hit abundance score matrix comprises processing the hit abundance score matrix in the form of a bipartite graph in which a first set of nodes represents respective ones of the biological samples, a second set of nodes represents respective ones of the target genomic sequences, and edges between nodes in the first set and nodes in the second set represent hit abundance scores of the hit abundance score vectors of the hit abundance score matrix.

16. An apparatus comprising:
- a first processing node configured for communication with one or more additional processing nodes and with one or more of a plurality of geographically-distributed metagenomics sequencing centers via one or more networks;
  
  the first processing node being further configured;
  
  to process metagenomics sequencing results obtained from one or more of the metagenomics sequencing centers; and
  
  to provide surveillance functionality relating to at least one designated biological issue on behalf of one or more requesting clients based at least in part on the processing of metagenomics sequencing results performed by the first processing node and related processing performed by one or more of the additional processing nodes;
  
  wherein each of the metagenomics sequencing centers is configured to perform metagenomics sequencing on biological samples from respective sample sources in a corresponding data zone; and
  
  wherein the first processing node is implemented using at least one processing device comprising a processor coupled to a memory;
  
  wherein processing the metagenomics sequencing results further comprises generating a hit abundance score vector for a given one of the biological samples wherein the hit abundance score vector comprises a plurality of entries corresponding to respective occurrence frequencies of at least one read of the given biological sample in respective target genomic sequences; and
  
  wherein providing the surveillance functionality further comprises;
  
  performing a preprocessing operation to reduce a biclustering sample space of a genomic comparison component;
  
  generating a hit abundance score matrix for the genomic comparison component comprising a plurality of the hit abundance score vectors wherein one of rows and columns of the hit abundance score matrix correspond to respective different ones of the biological samples and the other of the rows and columns of the hit abundance score matrix correspond to respective different ones of the target genomic sequences; and
  
  performing a biclustering operation on the hit abundance score matrix.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The apparatus of claim 16 wherein each of at least a subset of the processing nodes comprises at least one worldwide data node configured to perform operations in accordance with at least one supported framework of a YARN cluster on one or more corresponding portions of the metagenomics sequencing results.
  - 18. The apparatus of claim 16 wherein the surveillance functionality relating to at least one designated biological issue comprises characterization of at least one of a disease, an infection and a contamination.
  - 19. A metagenomics-based biological surveillance system comprising the apparatus of claim 16.
  - 20. The apparatus of claim 16 wherein performing a biclustering operation on the hit abundance score matrix comprises processing the hit abundance score matrix in the form of a bipartite graph in which a first set of nodes represents respective ones of the biological samples, a second set of nodes represents respective ones of the target genomic sequences, and edges between nodes in the first set and nodes in the second set represent hit abundance scores of the hit abundance score vectors of the hit abundance score matrix.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Florissi, Patricia Gomes Soares, Ukelson, Michal Ziv, Dach, Ran, Benshahar, Arnon, Gudes, Ehud
Primary Examiner(s)
Negin, Russell S

Application Number

US14/983,991
Time in Patent Office

1,049 Days
Field of Search

None
US Class Current
CPC Class Codes

G06N 5/04   Inference or reasoning models

G16B 20/00   ICT specially adapted for f...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 30/20   Sequence assembly

G16B 40/00   ICT specially adapted for b...

G16B 45/00   ICT specially adapted for b...

G16B 5/00   ICT specially adapted for m...

H04L 43/065   related to network devices

H04L 47/70   Admission control; Resource...

H04L 47/762   triggered by the network

H04L 47/783   Distributed allocation of r...

H04L 47/827   Aggregation of resource all...

H04L 67/10   in which an application is ...

Distributed data processing platform for metagenomic monitoring and characterization

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

87 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed data processing platform for metagenomic monitoring and characterization

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

87 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links