Machine learning architecture for lifelong learning

US 10,055,685 B1
Filed: 10/16/2017
Issued: 08/21/2018
Est. Priority Date: 10/16/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for a machine learning system that mitigates catastrophic forgetting, comprising:

receiving a data item;

processing, by a first node that comprises a plurality of centroids, information from at least a portion of the data item to generate a first feature vector, wherein the first feature vector comprises a plurality of feature elements, each of the plurality of feature elements having a similarity value representing a similarity to one of the plurality of centroids;

selecting a subset of the plurality of feature elements from the first feature vector, the subset containing one or more feature elements of the plurality of feature elements that have highest similarity values;

generating a second feature vector from the first feature vector by replacing similarity values of feature elements in the first feature vector that are not in the subset with zeros;

processing the second feature vector by a second node to determine an output;

determining, by the first node, a novelty rating for the data item based on similarity values of the plurality of feature elements in at least one of the first feature vector or the second feature vector;

determining a relevancy rating for the data item; and

determining whether to update the first node based on the novelty rating and the relevancy rating.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Some embodiments described herein cover a machine learning architecture with a separated perception subsystem and application subsystem. These subsystems can be co-trained. In one example embodiment, a data item is received and information from the data item is processed by a first node to generate a first feature vector comprising a plurality of features, each of the plurality of features having a similarity value representing a similarity to one of a plurality of centroids. The first node selects a subset of the features from the first feature vector, the subset containing one or more features that have highest similarity values. The first node generates a second feature vector from the first feature vector by replacing similarity values of features in the first feature vector that are not in the subset with zeros. A second node then processes the second feature vector to determine an output.

31 Citations

View as Search Results

24 Claims

1. A computer-implemented method for a machine learning system that mitigates catastrophic forgetting, comprising:
- receiving a data item;
  
  processing, by a first node that comprises a plurality of centroids, information from at least a portion of the data item to generate a first feature vector, wherein the first feature vector comprises a plurality of feature elements, each of the plurality of feature elements having a similarity value representing a similarity to one of the plurality of centroids;
  
  selecting a subset of the plurality of feature elements from the first feature vector, the subset containing one or more feature elements of the plurality of feature elements that have highest similarity values;
  
  generating a second feature vector from the first feature vector by replacing similarity values of feature elements in the first feature vector that are not in the subset with zeros;
  
  processing the second feature vector by a second node to determine an output;
  
  determining, by the first node, a novelty rating for the data item based on similarity values of the plurality of feature elements in at least one of the first feature vector or the second feature vector;
  
  determining a relevancy rating for the data item; and
  
  determining whether to update the first node based on the novelty rating and the relevancy rating.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein the output is selected from a group consisting of a classification of the data item, a prediction of a future state, and an action to be applied to an environment.
  - 3. The method of claim 1, wherein the data item comprises a target, the method further comprising:
    - determining an error based on a difference between the output and the target, wherein the relevancy rating is based at least in part on the error; and
      
      updating the second node based on the error.
  - 4. The method of claim 1, wherein updating the second node comprises updating weights associated with feature elements in the second feature vector that have non-zero values without updating weights associated with a remainder of feature elements in the second feature vector.
  - 5. The method of claim 4, wherein an amount that a weight associated with a particular feature element is updated is based at least in part on a plasticity factor associated with the particular feature element, the method further comprising:
    - decreasing the plasticity factor associated with the particular feature element after adjusting the weight associated with the particular feature element.
  - 6. The method of claim 1, wherein the second feature vector is input into a linear function of the second node, wherein the linear function is:
  - 7. The method of claim 6, wherein w_iiis updated based on the function:
    - w_ij=w_ij+η
      
      e_iX_jp_ijwhere η
      
      is a step size, e_iis an error associated with i and p_ijis a plasticity factor associated with w_ij, the method further comprising;
      
      decreasing a value of the plasticity factor p_ijresponsive to adjusting w_ij.
  - 8. The method of claim 1, further comprising:
    - determining that the novelty rating exceeds a novelty threshold and that the relevancy rating is below a relevancy threshold; and
      
      determining that no update to the first node is to be made.
  - 9. The method of claim 1, further comprising:
    - determining that the novelty rating is below a novelty threshold and that the relevancy rating is below a relevancy threshold;
      
      determining updates for a subset of the plurality of centroids that are associated with the subset of the plurality of feature elements; and
      
      updating the subset of the plurality of centroids without updating other centroids of the plurality of centroids.
  - 10. The method of claim 9, wherein the data item does not include a target.
  - 11. The method of claim 9, wherein the information from at least the portion of the data item comprises a point in a multi-dimensional space, and wherein updating a centroid comprises moving the centroid towards the point in the multi-dimensional space.
  - 12. The method of claim 11, wherein determining a distance to move a particular centroid comprises applying a centroid update rule of:
    - M_i=M_i+η
      
      (X−
      
      M_i)p_iwhere i is an index for the particular centroid, M_iis a position of the particular centroid in the multi-dimensional space, η
      
      is a step size, X is the point in the multi-dimensional space, and p_iis a plasticity factor for the particular centroid, the plasticity factor having a value that is based on a number of times that the particular centroid has been updated.
  - 13. The method of claim 12, further comprising:
    - reducing the value of the plasticity factor for the particular centroid after updating the particular centroid.
  - 14. The method of claim 1, further comprising:
    - determining that the novelty rating is above a novelty threshold and that the relevancy rating is above a relevancy threshold; and
      
      allocating a new centroid for the first node.
  - 15. The method of claim 14, further comprising:
    - initiating a refractory period for the first node, wherein no new centroids can be allocated for the first node during the refractory period.
  - 16. The method of claim 1, further comprising:
    - determining that the novelty rating is below a novelty threshold and that the relevancy rating exceeds a relevancy threshold; and
      
      determining that no update to the first node is to be made.
  - 17. The method of claim 1, wherein the first node is a component of a perception subsystem of the machine learning system and wherein the second node is a component of an application subsystem of the machine learning system, the method further comprising:
    - generating additional feature vectors by one or more additional nodes in the perception subsystem; and
      
      processing the second feature vector and the additional feature vectors by the second node to determine the output, wherein the output is based on a combination of the second feature vector and the additional feature vectors.
  - 18. The method of claim 17, further comprising:
    - co-training the perception subsystem and the application subsystem based on labeled data items and unlabeled data items, wherein a first function is used to train nodes in the perception subsystem and a second function is used to train nodes in the application subsystem.

19. A system comprising:
- at least one memory to store instructions for a machine learning system that mitigates catastrophic forgetting; and
  
  at least one processing device, operatively coupled to the at least one memory, to execute the instructions, wherein the instructions cause the processing device to;
  
  receive a data item;
  
  process, by a first node that comprises a plurality of centroids, information from at least a portion of the data item to generate a first feature vector, wherein the first feature vector comprises a plurality of feature elements, each of the plurality of feature elements having a similarity value representing a similarity to one of the plurality of centroids;
  
  select a subset of the plurality of feature elements from the first feature vector, the subset containing one or more feature elements of the plurality of feature elements that have highest similarity values;
  
  generate a second feature vector from the first feature vector by replacing similarity values of feature elements in the first feature vector that are not in the subset with zeros;
  
  process the second feature vector by a second node to determine an output;
  
  determine, by the first node, a novelty rating for the data item based on similarity values of the plurality of feature elements in at least one of the first feature vector or the second feature vector;
  
  determine a relevancy rating for the data item; and
  
  determine whether to update the first node based on the novelty rating and the relevancy rating.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The system of claim 19, wherein the data item comprises a target, and wherein the instructions further cause the processing device to:
    - determine an error based on a difference between the output and the target, wherein the relevancy rating is based at least in part on the error; and
      
      update the second node based on the error.
  - 21. The system of claim 19, wherein updating the second node comprises updating weights associated with feature elements in the second feature vector that have non-zero values without updating weights associated with a remainder of feature elements in the second feature vector, wherein an amount that a weight associated with a particular feature element is updated is based at least in part on a plasticity factor associated with the particular feature element, and wherein the instructions further cause the processing device to:
    - decrease the plasticity factor associated with the particular feature element after adjusting the weight associated with the particular feature element.
  - 22. The system of claim 19, wherein the instructions further cause the processing device to:
    - determine that the novelty rating is below a novelty threshold and that the relevancy rating is below a relevancy threshold;
      
      determine updates for a subset of the plurality of centroids that are associated with the subset of the plurality of feature elements; and
      
      update the subset of the plurality of centroids without updating other centroids of the plurality of centroids.
  - 23. The system of claim 19, wherein the instructions further cause the processing device to:
    - determine that the novelty rating is above a novelty threshold and that the relevancy rating is above a relevancy threshold; and
      
      allocate a new centroid for the first node.
  - 24. The system of claim 23, wherein the instructions further cause the processing device to:
    - determine after a time period that the new centroid fails to satisfy a centroid retention criterion; and
      
      remove the new centroid from the first node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
Apprente, Inc. (International Business Machines Corporation)
Inventors
Arel, Itamar, Looks, Joshua Benjamin
Primary Examiner(s)
Chen, Alan

Application Number

US15/785,270
Time in Patent Office

309 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06N 20/00   Machine learning

G06N 3/042   Knowledge-based neural netw...

G06N 3/08   Learning methods

Machine learning architecture for lifelong learning

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

31 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Machine learning architecture for lifelong learning

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links