Exemplar-Based Latent Perceptual Modeling for Automatic Speech Recognition

US 20140088964A1
Filed: 09/25/2012
Published: 03/27/2014
Est. Priority Date: 09/25/2012
Status: Active Grant

First Claim

Patent Images

1. A method for recognizing speech in an output domain, the method comprising:

at a device comprising one or more processors and memory;

establishing a global speech recognition model based on an initial set of training data;

receiving a plurality of input speech segments to be recognized in the output domain; and

for each of the plurality of input speech segments;

identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;

generating a respective focused speech recognition model based on the respective set of focused training data; and

providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and computer-readable media related to selecting observation-specific training data (also referred to as “observation-specific exemplars”) from a general training corpus, and then creating, from the observation-specific training data, a focused, observation-specific acoustic model for recognizing the observation in an output domain are disclosed. In one aspect, a global speech recognition model is established based on an initial set of training data; a plurality of input speech segments to be recognized in an output domain are received; and for each of the plurality of input speech segments: a respective set of focused training data relevant to the input speech segment is identified in the global speech recognition model; a respective focused speech recognition model is generated based on the respective set of focused training data; and the respective focused speech recognition model is provided to a recognition device for recognizing the input speech segment in the output domain.

93 Citations

View as Search Results

30 Claims

1. A method for recognizing speech in an output domain, the method comprising:
- at a device comprising one or more processors and memory;
  
  establishing a global speech recognition model based on an initial set of training data;
  
  receiving a plurality of input speech segments to be recognized in the output domain; and
  
  for each of the plurality of input speech segments;
  
  identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;
  
  generating a respective focused speech recognition model based on the respective set of focused training data; and
  
  providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the recognition device is a user device, and the plurality of input speech segments have been derived from a speech input received from a user by the user device.
  - 3. The method of claim 1, wherein, for at least one of the plurality of input speech segments, the global speech recognition model is a respective focused speech recognition model generated in a previous iteration of the identifying and generating performed for the at least one input speech segment.
  - 4. The method of claim 1, wherein establishing the global speech recognition model based on the initial set of training data further comprises:
    - generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates; and
      
      deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
  - 5. The method of claim 1, wherein identifying in the global speech model the respective set of focused training data relevant to the input speech segment further comprises:
    - mapping the input speech segment and a set of candidate training data into the global latent space, the set of candidate training data including candidate speech segments and candidate speech templates; and
      
      identifying, from the candidate speech segments and candidate speech templates, a plurality of exemplar segments and a plurality of exemplar templates for inclusion in the respective set of focused training data, wherein the exemplar segments and exemplar templates satisfy a threshold degree of similarity to the input speech segment as measured in the global latent space.
  - 6. The method of claim 5, further comprising:
    - generating additional training data from the plurality of training speech samples, the additional training data includes additional speech segments and additional speech templates outside of the initial set of speech segments and the initial set of speech templates.
  - 7. The method of claim 5, wherein generating the respective focused speech recognition model based on the respective set of focused training data comprises:
    - deriving a focused latent space from the plurality of exemplar segments and the plurality of exemplar templates.
  - 8. The method of claim 5, wherein deriving the focused latent space from the plurality of exemplar segments and the plurality of exemplar templates comprises:
    - modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates; and
      
      deriving the focused latent space from the pluralities of exemplar segments and exemplar templates after the modification.
  - 9. The method of claim 5, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - merging two or more of the plurality of exemplar templates into a new exemplar template in the plurality of exemplar template.
  - 10. The method of claim 5, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - generating at least one new exemplar template from the plurality of exemplar segments; and
      
      including the at least one new exemplar template in the plurality of exemplar templates.
  - 11. The method of claim 5, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - removing at least one exemplar template from the plurality of exemplar templates.

12. A method for recognizing speech in an output domain, the method comprising:
- at a client device comprising one or more processors and memory;
  
  receiving a speech input from a user;
  
  for each of a plurality of input speech segments in the speech input;
  
  receiving a respective focused speech recognition model, wherein the respective focused speech recognition model is generated based on a respective set of focused training data relevant to the input speech segment, wherein the respective set of focused training data is selected for the input speech segment in a global speech recognition model, and wherein the global speech recognition model is generated based on a set of global training data; and
  
  recognizing the input speech segment using the respective focused speech recognition model.

13. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- establishing a global speech recognition model based on an initial set of training data;
  
  receiving a plurality of input speech segments to be recognized in an output domain; and
  
  for each of the plurality of input speech segments;
  
  identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;
  
  generating a respective focused speech recognition model based on the respective set of focused training data; and
  
  providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computer-readable medium of claim 13, wherein establishing the global speech recognition model based on the initial set of training data further comprises:
    - generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates; and
      
      deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
  - 15. The computer-readable medium of claim 13, wherein identifying in the global speech model the respective set of focused training data relevant to the input speech segment further comprises:
    - mapping the input speech segment and a set of candidate training data into the global latent space, the set of candidate training data including candidate speech segments and candidate speech templates; and
      
      identifying, from the candidate speech segments and candidate speech templates, a plurality of exemplar segments and a plurality of exemplar templates for inclusion in the respective set of focused training data, wherein the exemplar segments and exemplar templates satisfy a threshold degree of similarity to the input speech segment as measured in the global latent space.
  - 16. The computer-readable medium of claim 15, wherein the operations further comprise:
    - generating additional training data from the plurality of training speech samples, the additional training data includes additional speech segments and additional speech templates outside of the initial set of speech segments and the initial set of speech templates.
  - 17. The computer-readable medium of claim 15, wherein deriving the focused latent space from the plurality of exemplar segments and the plurality of exemplar templates comprises:
    - modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates; and
      
      deriving the focused latent space from the pluralities of exemplar segments and exemplar templates after the modification.
  - 18. The computer-readable medium of claim 15, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - merging two or more of the plurality of exemplar templates into a new exemplar template in the plurality of exemplar template.
  - 19. The computer-readable medium of claim 15, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - generating at least one new exemplar template from the plurality of exemplar segments; and
      
      including the at least one new exemplar template in the plurality of exemplar templates.
  - 20. The computer-readable medium of claim 15, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - removing at least one exemplar template from the plurality of exemplar templates.

21. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- at a client device;
  
  receiving a speech input from a user;
  
  for each of a plurality of input speech segments in the speech input;
  
  receiving a respective focused speech recognition model, wherein the respective focused speech recognition model is generated based on a respective set of focused training data relevant to the input speech segment, wherein the respective set of focused training data is selected for the input speech segment in a global speech recognition model, and wherein the global speech recognition model is generated based on a set of global training data; and
  
  recognizing the input speech segment using the respective focused speech recognition model.

22. A system, comprising:
- one or more processors; and
  
  memory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the processors to perform operations comprising;
  
  establishing a global speech recognition model based on an initial set of training data;
  
  receiving a plurality of input speech segments to be recognized in an output domain; and
  
  for each of the plurality of input speech segments;
  
  identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;
  
  generating a respective focused speech recognition model based on the respective set of focused training data; and
  
  providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29)
- - 23. The system of claim 22, wherein establishing the global speech recognition model based on the initial set of training data further comprises:
    - generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates; and
      
      deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
  - 24. The system of claim 22, wherein identifying in the global speech model the respective set of focused training data relevant to the input speech segment further comprises:
    - mapping the input speech segment and a set of candidate training data into the global latent space, the set of candidate training data including candidate speech segments and candidate speech templates; and
      
      identifying, from the candidate speech segments and candidate speech templates, a plurality of exemplar segments and a plurality of exemplar templates for inclusion in the respective set of focused training data, wherein the exemplar segments and exemplar templates satisfy a threshold degree of similarity to the input speech segment as measured in the global latent space.
  - 25. The system of claim 24, wherein the operations further comprise:
    - generating additional training data from the plurality of training speech samples, the additional training data includes additional speech segments and additional speech templates outside of the initial set of speech segments and the initial set of speech templates.
  - 26. The system of claim 24, wherein deriving the focused latent space from the plurality of exemplar segments and the plurality of exemplar templates comprises:
    - modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates; and
      
      deriving the focused latent space from the pluralities of exemplar segments and exemplar templates after the modification.
  - 27. The system of claim 24, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - merging two or more of the plurality of exemplar templates into a new exemplar template in the plurality of exemplar template.
  - 28. The system of claim 24, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - generating at least one new exemplar template from the plurality of exemplar segments; and
      
      including the at least one new exemplar template in the plurality of exemplar templates.
  - 29. The system of claim 24, wherein modifying at least one of the pluralities of exemplar templates and exemplar segments based on the pluralities of exemplar segments and exemplar templates comprises:
    - removing at least one exemplar template from the plurality of exemplar templates.

30. A system, comprising:
- one or more processors; and
  
  memory having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising;
  
  at a client device;
  
  receiving a speech input from a user;
  
  for each of a plurality of input speech segments in the speech input;
  
  receiving a respective focused speech recognition model from a server, wherein the respective focused speech recognition model is generated based on a respective set of focused training data relevant to the input speech segment, wherein the respective set of focused training data is selected for the input speech segment in a global speech recognition model, and wherein the global speech recognition model is generated based on a set of global training data; and
  
  recognizing the input speech segment using the respective focused speech recognition model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Bellegarda, Jerome

Granted Patent

US 8,935,167 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/243
CPC Class Codes

G10L 15/063 Training

Exemplar-Based Latent Perceptual Modeling for Automatic Speech Recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

93 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Exemplar-Based Latent Perceptual Modeling for Automatic Speech Recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

93 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links