DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS

US 20180247200A1
Filed: 08/18/2016
Published: 08/30/2018
Est. Priority Date: 08/19/2015
Status: Active Grant

First Claim

Patent Images

1. A method for unsupervised learning over an input space comprising discrete or continuous variables, and at least a subset of a training dataset of samples of the respective variables, to attempt to identify the value of at least one parameter that increases the log-likelihood of the at least a subset of a training dataset with respect to a model, the model expressible as a function of the at least one parameter, the method executed by circuitry including at least one processor and comprising;

forming a first latent space comprising a plurality of random variables, the plurality of random variables comprising one or more discrete random variables;

forming a second latent space comprising the first latent space and a set of supplementary continuous random variables;

forming a first transforming distribution comprising a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables of the first latent space;

forming an encoding distribution comprising an approximating posterior distribution over the first latent space, conditioned on the input space;

forming a prior distribution over the first latent space;

forming a decoding distribution comprising a conditional distribution over the input space conditioned on the set of supplementary continuous random variables;

determining an ordered set of conditional cumulative distribution functions of the supplementary continuous random variables, each cumulative distribution function comprising functions of a full distribution of at least one of the one or more discrete random variables of the first latent space;

determining an inversion of the ordered set of conditional cumulative distribution functions of the supplementary continuous random variables;

constructing a first stochastic approximation to a lower bound on the log-likelihood of the at least a subset of a training dataset;

constructing a second stochastic approximation to a gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset; and

increasing the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computational system can include digital circuitry and analog circuitry, for instance a digital processor and a quantum processor. The quantum processor can operate as a sample generator providing samples. Samples can be employed by the digital processing in implementing various machine learning techniques. For example, the computational system can perform unsupervised learning over an input space, for example via a discrete variational auto-encoder, and attempting to maximize the log-likelihood of an observed dataset. Maximizing the log-likelihood of the observed dataset can include generating a hierarchical approximating posterior.

25 Citations

36 Claims

1. A method for unsupervised learning over an input space comprising discrete or continuous variables, and at least a subset of a training dataset of samples of the respective variables, to attempt to identify the value of at least one parameter that increases the log-likelihood of the at least a subset of a training dataset with respect to a model, the model expressible as a function of the at least one parameter, the method executed by circuitry including at least one processor and comprising;
- forming a first latent space comprising a plurality of random variables, the plurality of random variables comprising one or more discrete random variables;
  
  forming a second latent space comprising the first latent space and a set of supplementary continuous random variables;
  
  forming a first transforming distribution comprising a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables of the first latent space;
  
  forming an encoding distribution comprising an approximating posterior distribution over the first latent space, conditioned on the input space;
  
  forming a prior distribution over the first latent space;
  
  forming a decoding distribution comprising a conditional distribution over the input space conditioned on the set of supplementary continuous random variables;
  
  determining an ordered set of conditional cumulative distribution functions of the supplementary continuous random variables, each cumulative distribution function comprising functions of a full distribution of at least one of the one or more discrete random variables of the first latent space;
  
  determining an inversion of the ordered set of conditional cumulative distribution functions of the supplementary continuous random variables;
  
  constructing a first stochastic approximation to a lower bound on the log-likelihood of the at least a subset of a training dataset;
  
  constructing a second stochastic approximation to a gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset; and
  
  increasing the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset.
- View Dependent Claims (7, 8, 10, 11, 13, 14, 15, 16)
- - 7. The method of claim 1 wherein constructing a first stochastic approximation to the lower bound of the log-likelihood of the at least a subset of a training dataset includes:
    - decomposing the first stochastic approximation to the lower bound into at least a first part comprising negative KL-divergence between the approximating posterior and the prior distribution over the first latent space, and a second part comprising an expectation, or at least a stochastic approximation to an expectation, with respect to the approximating posterior over the second latent space of the conditional log-likelihood of the at least a subset of a training dataset under the decoding distribution.
  - 8. The method of claim 1 wherein constructing a second stochastic approximation to the gradient of the lower bound includes:
    - determining the gradient of the second part of the first stochastic approximation by backpropagation;
      
      approximating the gradient of the first part of the first stochastic approximation with respect to one or more parameters of the prior distribution over the first latent space using samples from the prior distribution; and
      
      determining a gradient of the first part of the first stochastic approximation with respect to parameters of the encoding distribution by backpropagation.
  - 10. The method of claim 1 wherein a logarithm of the prior distribution is, to within a constant, a problem Hamiltonian of a quantum processor.
  - 11. The method of claim 1, further comprising:
    - generating samples or causing samples to be generated by a quantum processor; and
      
      determining an expectation with respect to the prior distribution from the samples.
  - 13. The method of claim 11 wherein generating samples or causing samples to be generated by at least one quantum processor includes:
    - operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the at least one quantum processor, and wherein operating the at least one quantum processor as a sample generator comprises;
      
      programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor;
      
      evolving the quantum processor; and
      
      reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.
  - 14. The method of claim 1, further comprising:
    - at least one of generating, or at least approximating, samples or causing samples to be generated, or least approximated, by a restricted Boltzmann machine; and
      
      determining the expectation with respect to the prior distribution from the samples.
  - 15. The method of claim 1, wherein the set of supplementary continuous random variables comprises a plurality of continuous variables, and each one of the plurality of continuous variables is conditioned on a different respective one of the plurality of random variables.
  - 16. The method of claim 1, further comprising:
    - forming a second transforming distribution, wherein the input space comprises a plurality of input variables, and the second transforming distribution is conditioned on one or more of the plurality of input variables and at least one of the one or more discrete random variables.

2. The method of 1 wherein increasing the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset includes increasing the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent.

3. The method of 2 wherein increasing the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent includes attempting to maximize the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent.

4-6. -6. (canceled)

9. (canceled)

12. (canceled)

17. A computational system, comprising:
- at least one processor; and
  
  form a first latent space comprising a plurality of random variables, the plurality of random variables comprising one or more discrete random variables;
  
  form a second latent space comprising the first latent space and a set of supplementary continuous random variables;
  
  form a first transforming distribution comprising a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables of the first latent space;
  
  form an encoding distribution comprising an approximating posterior distribution over the first latent space, conditioned on the input space;
  
  form a prior distribution over the first latent space;
  
  form a decoding distribution comprising a conditional distribution over the input space conditioned on the set of supplementary continuous random variables;
  
  determine an ordered set of conditional cumulative distribution functions of the supplementary continuous random variables, each cumulative distribution function comprising functions of a full distribution of at least one of the one or more discrete random variables of the first latent space;
  
  determine an inversion of the ordered set of conditional cumulative distribution functions of the supplementary continuous random variables;
  
  construct a first stochastic approximation to a lower bound on the log-likelihood of the at least a subset of a training dataset;
  
  construct a second stochastic approximation to a gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset; and
  
  increase the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset.

18. A method for unsupervised learning by a computational system, the method executable by circuitry including at least one processor and comprising:
- forming a model, the model comprising one or more model parameters;
  
  initializing the model parameters;
  
  receiving a training dataset comprising a plurality of subsets of the training dataset;
  
  testing to determine if a stopping criterion has been met;
  
  in response to determining the stopping criterion has not been met;
  
  fetching a mini-batch comprising one of the plurality of subsets of the training dataset, the mini-batch comprising input data;
  
  performing propagation through an encoder that computes an approximating posterior distribution over a discrete space;
  
  sampling from the approximating posterior distribution over a set of continuous random variables via a sampler;
  
  performing propagation through a decoder that computes an auto-encoded distribution over the input data;
  
  performing backpropagation through the decoder of a log-likelihood of the input data with respect to the auto-encoded distribution over the input data;
  
  performing backpropagation through the sampler that samples from the approximating posterior distribution over the set of continuous random variables to generate an auto-encoded gradient;
  
  determining a first gradient of a KL-divergence, with respect to the approximating posterior, between the approximating posterior distribution and a true prior distribution over the discrete space;
  
  performing backpropagation through the encoder of a sum of the auto-encoding gradient and the first gradient of the KL-divergence with respect to the approximating posterior;
  
  determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space;
  
  determining at least one of a gradient or at least a stochastic approximation of a gradient, of a bound on the log-likelihood of the input data;
  
  updating the model parameters based at least in part on the determined at least one of the gradient or at least a stochastic approximation of the gradient, of the bound on the log-likelihood of the input data.
- View Dependent Claims (22, 23, 24, 27, 28, 29, 33, 34, 35)
- - 22. The method of claim 18, further comprising:
    - receiving at least a subset of a validation dataset, wherein testing to determine if a stopping criterion has been met includes determining a measure of validation loss on the at least a subset of a validation dataset computed on two or more successive passes, and testing to determine if the measure of validation loss meets a predetermined criterion.
  - 23. The method of claim 18 wherein determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space includes determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space by generating samples or causing samples to be generated by a quantum processor.
  - 24. The method of operation of claim 23 wherein generating samples or causing samples to be generated by a quantum processor includes:
    - operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the at least one quantum processor, and wherein operating the at least one quantum processor as a sample generator comprises;
      
      programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor;
      
      evolving the at least one quantum processor; and
      
      reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.
  - 27. The method for unsupervised learning of claim 18, further comprising:
    - dividing the discrete space into a first plurality of disjoint groups; and
      
      dividing the set of supplementary continuous random variables into a second plurality of disjoint groups,wherein performing propagation through an encoder that computes an approximating posterior over a discrete space includes;
      
      determining a processing sequence for the first and the second plurality of disjoint groups; and
      
      for each of the first plurality of disjoint groups in an order determined by the processing sequence, performing propagation through an encoder that computes an approximating posterior, the approximating posterior conditioned on at least one of the previous ones in the processing sequence of the second plurality of disjoint groups and at least one of the plurality of input variables.
  - 28. The method of claim 27, wherein:
    - dividing the discrete space into a first plurality of disjoint groups includes dividing the discrete space into a first plurality of disjoint groups by random assignment of discrete variables in the discrete space.
  - 29. The method of claim 27, wherein:
    - dividing the discrete space into a first plurality of disjoint groups includes dividing the discrete space into a first plurality of disjoint groups to generate even-sized groups in the first plurality of disjoint groups.
  - 33. The method of claim 27, further comprising:
    - receiving at least a subset of a validation dataset, wherein testing to determine if a stopping criterion has been met includes determining a measure of validation loss on the at least a subset of a validation dataset computed on two or more successive passes, and testing to determine if the measure of validation loss meets a predetermined criterion.
  - 34. The method of claim 27 wherein determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space includes determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space by generating samples or causing samples to be generated by a quantum processor.
  - 35. The method of claim 34 wherein generating samples or causing samples to be generated by a quantum processor includes:
    - operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the analog processor, and wherein operating the at least one quantum processor as a sample generator comprises;
      
      programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor,evolving the at least one quantum processor, andreading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.

19-21. -21. (canceled)

25-26. -26. (canceled)

30-32. -32. (canceled)

36-42. -42. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
D-Wave Systems Inc. (D-Wave Quantum, Inc.)
Original Assignee
D-Wave Systems Inc. (D-Wave Quantum, Inc.)
Inventors
Rolfe, Jason

Granted Patent

US 11,157,817 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06N 10/00   Quantum computing, i.e. inf...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/047   Probabilistic or stochastic...

G06N 3/08   Learning methods

G06N 3/084   Backpropagation, e.g. using...

G06N 3/086   using evolutionary algorith...

G06N 3/088   Non-supervised learning, e....

DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

25 Citations

36 Claims

Specification

Use Cases

Quick Links

Others

DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

36 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others