Techniques for processing queries relating to task-completion times or cross-data-structure interactions

US 9,811,438 B1
Filed: 05/11/2017
Issued: 11/07/2017
Est. Priority Date: 12/02/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data, the method comprising:

accessing a structure including at least part of a definition for a workflow, the workflow including;

a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence;

a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and

a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;

for each client of a plurality of clients;

accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material;

performing an iteration of the workflow using the set of reads;

generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;

a result of a task in the workflow;

a time required to perform one or more tasks in the workflow;

and/ora degree of usage of a computational resource while performing one or more tasks in the workflow;

storing the iteration data in association with an identifier of the client;

collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data;

using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and

generating a communication that represents the anomaly subset.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems disclosed herein relate generally to data processing by applying machine learning techniques to iteration data to identify anomaly subsets of iteration data. More specifically, iteration data for individual iterations of a workflow involving a set of tasks may contain a client data set, client-associated sparse indicators and their classifications, and a set of processing times for the set of tasks performed in that iteration of the workflow. These individual iterations of the workflow may also be associated with particular data sources. Using the iteration data, anomaly subsets within the iteration data can be identified, such as data items resulting from systematic error associated with particular data sources, sets of sparse indicators to be validated or double-checked, or tasks that are associated with long processing times. The anomaly subsets can be provided in a generated communication or report in order to optimize future iterations of the workflow.

Citations

20 Claims

1. A computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data, the method comprising:
- accessing a structure including at least part of a definition for a workflow, the workflow including;
  
  a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence;
  
  a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and
  
  a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;
  
  for each client of a plurality of clients;
  
  accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material;
  
  performing an iteration of the workflow using the set of reads;
  
  generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;
  
  a result of a task in the workflow;
  
  a time required to perform one or more tasks in the workflow;
  
  and/ora degree of usage of a computational resource while performing one or more tasks in the workflow;
  
  storing the iteration data in association with an identifier of the client;
  
  collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data;
  
  using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and
  
  generating a communication that represents the anomaly subset.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 1, wherein:
    - the iteration data includes, for each task of a plurality of tasks of the workflow, a processing-time variable that indicates when a performance of the task was completed or a duration of performance of the task; and
      
      the anomaly subset of the set of iteration data identified using the machine-learning technique identifies a task of the plurality of tasks associated with long processing times relative to past processing times or normalized or unnormalized processing times of one or more other tasks of the plurality of tasks.
  - 3. The computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 1, wherein:
    - the iteration data identifies one or more sparse indicators associated the client, such that the set of iteration data identifies a plurality of sparse indicators;
      
      the anomaly subset of the set of iteration data identified using the machine-learning technique identifies a subset of the plurality of sparse indicators; and
      
      the communication facilitates selective confirmatory processing to be performed to determine whether data corresponding to the subset of the plurality of sparse indicators is validated.
  - 4. The computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 1, wherein:
    - the iteration data further includes, for each client of the plurality of clients, an origination identifier associated with a source of the set of reads and a timestamp;
      
      using the machine-learning technique to process the set of iteration data includes determining that results corresponding to a first origination identifier are statistically different than results corresponding to one or more second origination identifiers or than results corresponding to a prior time period and the first origination identifier; and
      
      the communication identifies the source associated with the first origination identifier.
  - 5. The computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 1, wherein:
    - the iteration data further includes one or more data-source variables that identify or characterize a source of the iteration data; and
      
      using the machine-learning technique includes updating or generating a model to identify data-source variables predictive of the result.
  - 6. The computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 1, wherein using the machine-learning technique comprises:
    - for each portion of multiple portions;
      
      retrieving a parameter for a machine-learning model trained on another set of iteration data, the parameter reflecting a degree of variability observed in the another set of iteration data across clients or iterations;
      
      generating an observed variability for the portion using the set of iteration data that reflects a degree of variability observed in the set of iteration data across clients or iterations;
      
      determining whether the observed variability for the portion corresponds with the parameter; and
      
      for each portion of the multiple portions for which it is determined that the observed variability for the portion does not correspond with the parameter, identifying the portion in the anomaly subset.
  - 7. The computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 1, further comprising:
    - receiving, from a source, a request to perform an anomaly-detection assessment, wherein the set of iteration data is collected and processed in response to receiving the request; and
      
      availing the communication to the source.

8. A system for using machine learning to identify anomaly subsets of sets of iteration data, the system comprising:
- one or more data processors; and
  
  a non-transitory computer readable storage medium containing instructions which when executed on the one or more data processors, cause the one or more data processors to perform actions including;
  
  accessing a structure including at least part of a definition for a workflow, the workflow including;
  
  a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence;
  
  a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and
  
  a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;
  
  for each client of a plurality of clients;
  
  accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material;
  
  performing an iteration of the workflow using the set of reads;
  
  generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;
  
  a result of a task in the workflow;
  
  a time required to perform one or more tasks in the workflow;
  
  and/ora degree of usage of a computational resource while performing one or more tasks in the workflow;
  
  storing the iteration data in association with an identifier of the client;
  
  collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data;
  
  using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and
  
  generating a communication that represents the anomaly subset.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 8, wherein:
    - the iteration data includes, for each task of a plurality of tasks of the workflow, a processing-time variable that indicates when a performance of the task was completed or a duration of performance of the task; and
      
      the anomaly subset of the set of iteration data identified using the machine-learning technique identifies a task of the plurality of tasks associated with long processing times relative to past processing times or normalized or unnormalized processing times of one or more other tasks of the plurality of tasks.
  - 10. The system for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 8, wherein:
    - the iteration data identifies one or more sparse indicators associated the client, such that the set of iteration data identifies a plurality of sparse indicators;
      
      the anomaly subset of the set of iteration data identified using the machine-learning technique identifies a subset of the plurality of sparse indicators; and
      
      the communication facilitates selective confirmatory processing to be performed to determine whether data corresponding to the subset of the plurality of sparse indicators is validated.
  - 11. The system for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 8, wherein:
    - the iteration data further includes, for each client of the plurality of clients, an origination identifier associated with a source of the set of reads and a timestamp;
      
      using the machine-learning technique to process the set of iteration data includes determining that results corresponding to a first origination identifier are statistically different than results corresponding to one or more second origination identifiers or than results corresponding to a prior time period and the first origination identifier; and
      
      the communication identifies the source associated with the first origination identifier.
  - 12. The system for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 8, wherein:
    - the iteration data further includes one or more data-source variables that identify or characterize a source of the iteration data; and
      
      using the machine-learning technique includes updating or generating a model to identify data-source variables predictive of the result.
  - 13. The system for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 8, wherein using the machine-learning technique comprises:
    - for each portion of multiple portions;
      
      retrieving a parameter for a machine-learning model trained on another set of iteration data, the parameter reflecting a degree of variability observed in the another set of iteration data across clients or iterations;
      
      generating an observed variability for the portion using the set of iteration data that reflects a degree of variability observed in the set of iteration data across clients or iterations;
      
      determining whether the observed variability for the portion corresponds with the parameter; and
      
      for each portion of the multiple portions for which it is determined that the observed variability for the portion does not correspond with the parameter, identifying the portion in the anomaly subset.
  - 14. The system for using machine learning to identify anomaly subsets of sets of iteration data as recited in claim 8, wherein the actions further include:
    - receiving, from a source, a request to perform an anomaly-detection assessment, wherein the set of iteration data is collected and processed in response to receiving the request; and
      
      availing the communication to the source.

15. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, including instructions configured to cause one or more data processors to perform actions including:
- accessing a structure including at least part of a definition for a workflow, the workflow including;
  
  a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence;
  
  a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and
  
  a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;
  
  for each client of a plurality of clients;
  
  accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material;
  
  performing an iteration of the workflow using the set of reads;
  
  generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on;
  
  a result of a task in the workflow;
  
  a time required to perform one or more tasks in the workflow;
  
  and/ora degree of usage of a computational resource while performing one or more tasks in the workflow;
  
  storing the iteration data in association with an identifier of the client;
  
  collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data;
  
  using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and
  
  generating a communication that represents the anomaly subset.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-program product as recited in claim 15, wherein:
    - the iteration data includes, for each task of a plurality of tasks of the workflow, a processing-time variable that indicates when a performance of the task was completed or a duration of performance of the task; and
      
      the anomaly subset of the set of iteration data identified using the machine-learning technique identifies a task of the plurality of tasks associated with long processing times relative to past processing times or normalized or unnormalized processing times of one or more other tasks of the plurality of tasks.
  - 17. The computer-program product as recited in claim 15, wherein:
    - the iteration data identifies one or more sparse indicators associated the client, such that the set of iteration data identifies a plurality of sparse indicators;
      
      the anomaly subset of the set of iteration data identified using the machine-learning technique identifies a subset of the plurality of sparse indicators; and
      
      the communication facilitates selective confirmatory processing to be performed to determine whether data corresponding to the subset of the plurality of sparse indicators is validated.
  - 18. The computer-program product as recited in claim 15, wherein:
    - the iteration data further includes, for each client of the plurality of clients, an origination identifier associated with a source of the set of reads and a timestamp;
      
      using the machine-learning technique to process the set of iteration data includes determining that results corresponding to a first origination identifier are statistically different than results corresponding to one or more second origination identifiers or than results corresponding to a prior time period and the first origination identifier; and
      
      the communication identifies the source associated with the first origination identifier.
  - 19. The computer-program product as recited in claim 15, wherein:
    - the iteration data further includes one or more data-source variables that identify or characterize a source of the iteration data; and
      
      using the machine-learning technique includes updating or generating a model to identify data-source variables predictive of the result.
  - 20. The computer-program product as recited in claim 15, wherein using the machine-learning technique comprises:
    - for each portion of multiple portions;
      
      retrieving a parameter for a machine-learning model trained on another set of iteration data, the parameter reflecting a degree of variability observed in the another set of iteration data across clients or iterations;
      
      generating an observed variability for the portion using the set of iteration data that reflects a degree of variability observed in the set of iteration data across clients or iterations;
      
      determining whether the observed variability for the portion corresponds with the parameter; and
      
      for each portion of the multiple portions for which it is determined that the observed variability for the portion does not correspond with the parameter, identifying the portion in the anomaly subset.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Color Health Incorporated
Original Assignee
Color Genomics, Inc.
Inventors
Barrett, Ryan, Noguchi, Katsuya, Bhat, Nishant, Li, Zhengua, Smith, Kurt
Primary Examiner(s)
Kim, Sisley

Application Number

US15/592,949
Time in Patent Office

180 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 11/3024   where the computing system ...

G06F 11/3419   by assessing time

G06F 9/4881   Scheduling strategies for d...

G06F 9/4887   involving deadlines, e.g. r...

G06N 20/00   Machine learning

G16H 50/20   for computer-aided diagnosi...

G16Z 99/00   Subject matter not provided...

Techniques for processing queries relating to task-completion times or cross-data-structure interactions

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Techniques for processing queries relating to task-completion times or cross-data-structure interactions

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links