Techniques for processing queries relating to task-completion times or cross-data-structure interactions
First Claim
1. A computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data, the method comprising:
- accessing a structure including at least part of a definition for a workflow, the workflow including;
a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence;
a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and
a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;
for each client of a plurality of clients;
accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material;
performing an iteration of the workflow using the set of reads;
generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;
a result of a task in the workflow;
a time required to perform one or more tasks in the workflow;
and/ora degree of usage of a computational resource while performing one or more tasks in the workflow;
storing the iteration data in association with an identifier of the client;
collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data;
using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and
generating a communication that represents the anomaly subset.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems disclosed herein relate generally to data processing by applying machine learning techniques to iteration data to identify anomaly subsets of iteration data. More specifically, iteration data for individual iterations of a workflow involving a set of tasks may contain a client data set, client-associated sparse indicators and their classifications, and a set of processing times for the set of tasks performed in that iteration of the workflow. These individual iterations of the workflow may also be associated with particular data sources. Using the iteration data, anomaly subsets within the iteration data can be identified, such as data items resulting from systematic error associated with particular data sources, sets of sparse indicators to be validated or double-checked, or tasks that are associated with long processing times. The anomaly subsets can be provided in a generated communication or report in order to optimize future iterations of the workflow.
-
Citations
20 Claims
-
1. A computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data, the method comprising:
-
accessing a structure including at least part of a definition for a workflow, the workflow including; a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence; a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set; for each client of a plurality of clients; accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material; performing an iteration of the workflow using the set of reads;
generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;a result of a task in the workflow; a time required to perform one or more tasks in the workflow; and/or a degree of usage of a computational resource while performing one or more tasks in the workflow; storing the iteration data in association with an identifier of the client; collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data; using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and generating a communication that represents the anomaly subset. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for using machine learning to identify anomaly subsets of sets of iteration data, the system comprising:
-
one or more data processors; and a non-transitory computer readable storage medium containing instructions which when executed on the one or more data processors, cause the one or more data processors to perform actions including; accessing a structure including at least part of a definition for a workflow, the workflow including; a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence; a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;
for each client of a plurality of clients;accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material; performing an iteration of the workflow using the set of reads; generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on; a result of a task in the workflow; a time required to perform one or more tasks in the workflow; and/or a degree of usage of a computational resource while performing one or more tasks in the workflow; storing the iteration data in association with an identifier of the client; collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data; using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and generating a communication that represents the anomaly subset. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, including instructions configured to cause one or more data processors to perform actions including:
-
accessing a structure including at least part of a definition for a workflow, the workflow including; a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence; a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set; for each client of a plurality of clients; accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material; performing an iteration of the workflow using the set of reads;
generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;a result of a task in the workflow; a time required to perform one or more tasks in the workflow; and/or a degree of usage of a computational resource while performing one or more tasks in the workflow; storing the iteration data in association with an identifier of the client; collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data; using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and generating a communication that represents the anomaly subset. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification